Friday, September 24, 2010

Troubleshooting Localization

I've been gathering some interesting and useful information when dealing with Pentaho Reporting, Pentaho Metadata and characters not represented in the standard ASCII character set. This bucket of tips will make it into our documentation ASAP, but I thought it prudent to share it with our community even sooner.

IMPORTANT CAVEAT: Note that where I specify UTF-8, I am only doing that as a reference encoding... the encoding I speak of in most cases can represent any extended character set; UTF-8 is a common one for multi-national apps, because it represents multi-national characters.

Character encoding is key to displaying multi-byte or special characters from character sets outside of the standard ASCII character set. Any text-based files that contain special characters in their glyph form must be encoded as at least UTF-8, or in the character encoding for the language you are attempting to display.

The character encoding is significant no matter where these characters reside or travel - if the file or database stores the characters as UTF-8, then Java must handle those characters as UTF-8 and where ever the characters' destination is, be it a browser window or system file, the destination must also render the characters using the same character encoding.

So, this means:

1. Check any and all TXT or CSV files in an appropriate editor to verify that they are encoded in the correct character encoding. In a pinch, Notepad will do, but if you are seriously dealing with localization, it's in your best interest to invest in or download a good unicode text editor.

2. Make sure that your HTML and XML files have a meta tag specifying your chosen as the character set. For example:

<?xml version="1.0" encoding="UTF-8"?>
<meta http-equiv="Content-Type" content="text/html; charset="UTF-8" >


And, if it actually appears in an xhtml document (as suggested by the xml declaration) the content type should probably text/xhtml, and the meta tag should be closed in itself like so:

<meta http-equiv="Content-Type" content="text/xhtml; charset="UTF-8" />

3. The Pentaho BI Server allows you to specify a default encoding in a context parameter in the web.xml file of the webapp. This "default encoding" applies to any XML documents that the server generates. The platform adds an xml prologue to these documents and sets the encoding to that of the BI Platform, which comes from web.xml. By default, the server assumes this is UTF-8. If you want a different default encoding, specify it in the web.xml.

4. You also want to make sure that the default encoding that Java (specifically, the JVM that is running the Pentaho application) is using matches the encoding that the Pentaho application is using. We just mentioned that the default encoding for the Pentaho BI Server is UTF-8. So, what is the default encoding for the JVM? The JVM determines it's encoding from the system property "file.encoding". As of Java 1.4.2, this property is available and set from as the default OS locale. However, on Windows systems, this default locale may not exist, so Java makes a best guess. As you can see, knowing what the default encoding is can be a bit nebulous, so we recommend setting the encoding for Java on the command line:

java -Dfile.encoding=UTF-8

You will want to add this command line parameter to any Pentaho application startup script that you are attempting to use internationally. Specifically for the Pentaho BI Server, you would want to set this command line parameter in the start-pentaho.bat | .sh script.

It's important to note that we don't demand UTF-8. We do (for now) demand that whatever file.encoding is specified is what the web.xml context parameter "encoding" says. So - as long as this param says ISO-8859-1 and file.encoding says ISO-8859-1, you're still good.

Next, understand that common fonts do not have all of the characters possibly represented in the UTF-8 character set or other extended character sets. So, while your encoding may be correct, if you specify a font that doesn't include the glyph for a multi-byte character, it's likely to render as a square, question mark, or some other seemingly unrelated character.

A good test font on Windows systems is "Arial Unicode MS", which is distributed with MS Office and is claimed to have every UTF-8 character glyph available. It's ability to represent every character makes it a good TEST font, but comes with a price - the font is nearly 24 MB. You do not want to recommend this as a production font, since as a best practice guideline, we tell customers to embed their fonts with certain output formats, and this font would equate to staggering overhead in download sizes. The proper recommendation is to tell customers to find the font that best represents the consumer base's languages for that report.

So how do we control which fonts and encodings are used in Pentaho Reports? It's a bucket of valuable information I'm attempting to summarize here:

First encodings:

In Pentaho reports, there are global configuration properties for the different output formats. The global report engine configuration can be found in the Pentaho BI Server installation under the pentaho webapp: pentaho/WEB-INF/classes/classic-engine.properties.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.Encoding=UTF-8
org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.Encoding=UTF-8
org.pentaho.reporting.engine.classic.core.modules.output.table.csv.Encoding=UTF-8


And fonts:

1. If you have a metadata model in play, make sure that the metadata concept properties for the font-family are all set to a font that is installed on the server serving up the model and is capable of rendering the special characters you need represented. There is a Base concept (found in the Concept Editor) that has a default font-family that you will want to verify/modify is configured correctly.

2. If you are using any of the templates designed for Report Design Wizard or Web Adhoc Query and Reporting, you will want to verify/modify those templates to use a font that is capable of rendering the special characters you need represented. The templates for Report Design Wizard are found in the Report Designer's /templates directory. The templates for WAQR are found in the Pentaho BI Server solutions directory under pentaho-solutions/system/waqr/templates.

3. On Windows, what determines whether Pentaho can find an installed font? A few things! First, look in the Windows Control Panel (or modern equivalent), under Fonts... these are the fonts that should be available to the reports generated by the Pentaho BI Server. If for some reason you want to include a font not in the system fonts directory, you can add additional directories of fonts.

This is done using a configuration file that you would create and place in the Pentaho webapp WEB-INF/classes directory, which basically creates an override for the configuration file that is found in the libfonts-x.x.x.jar library in the Pentaho webapp primary classpath. The name of the libfont report configuration is libfont.properties. Create this file, place it in the classes directory and add the following configuration property to it, with your font location of course.

org.pentaho.reporting.libraries.fonts.extra-font-dirs.myNewDir=c:/myNewDir/myFonts

Note: There is an open issue with this property that should be fixed with the SUGAR release of the Pentaho BI Server: http://jira.pentaho.com/browse/PRD-2145.

4. Our best practice recommendation for ensuring the proper rendering of special characters in PDF reports is:
a. Embed the font. This can be accomplished using the following global reporting configuration property: org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.EmbedFonts=true
b. The font should be a TrueType font.

Also important to note is that you can confirm what fonts the Pentaho BI Server is aware of, as the reporting engine creates a cache of the fonts it has registered. If you are at all concerned that the server hasn't correctly registered a new font from the system, you can blow away the cache, restart the server, and the reporting engine will load all system fonts anew.

The cache exists at $HOME/.pentaho/cache/libfonts.

3 comments :

rpbouman said...

Hi Gretchen!

thanks for this info - very useful.

I think I spotted one tiny error: I think that for a HTML document, he meta tag should be:

<meta http-equiv="Content-Type" content="text/html; charset="UTF-8" >

(so, http-equiv instead of equiv, and no self-closed tag)

And, if it actually appears in an xhtml document (as suggested by the xml declaration) the content type should probably text/xhtml, and the meta tag should be closed initself like so:


<meta http-equiv="Content-Type" content="text/xhtml; charset="UTF-8" />

Daniël van Eeden said...

The DejaVu fonts do have good unicode coverage and are available for both Windows and UNIX systems. And they're available under a free license which can be handy if you want to distribute them with your application.

http://en.wikipedia.org/wiki/DejaVu_fonts

Gretchen said...

Thanks for the correction Roland, I've incorporated it into the post :)

Daniël, thanks for the font tip!

kindest regards,
G