Question
Infosys Limited
US
Last activity: 1 Oct 2015 7:29 EDT
Non-English Characters in PDF
Hi,I am generating a PDF in which one of its fields has japanese characters.When japanese characters are assigned "???" is getting displayed.
We tried to upload the font file 'MS Gothic Regular' in '/app/IBM/WebSphere/AppServer/java/jre/lib/fonts/' directory and set below parameters before calling 'HTMLToPDF' activity
Param.pyPDFEmbedFont = "true"
Param.pyPDFFontsDirectory = "Java/jre/lib/fonts"
These changes didn't work and then we realised that the font file is of type 'ttc' and not 'ttf'. Is this file type an issue to display japanese characters in PDF?
-
Like (0)
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
PEG
IN
Can you place the folders in windows font directory and try.
C:\Windows\Fonts and Param.pyPDFFontsDirectory=” C:\Windows\Fonts”
Infosys Limited
US
Hi , server we are using has Linux as its OS..
PEG
IN
Please refer below link. It may help you.
Updated: 8 Sep 2015 13:18 EDT
Pegasystems Inc.
GB
Hi Vandana,
Converting HTML to PDFs that contain non-ASCII characters requires you to check a number of things; Japanese/Chinese and Thai introduce further complexity as well - so I would suggest you first start with confirming you can produce a PDF that contains (say) accented characters - or Greek/Russian instead.
So : here's what I would do first:
1. Create a simple HTML INPUT file - containing a non-ascii character - I'm just using a Greek 'alpha' here:
<html> <head> </head> <body> α </body> </html>
2. First make sure you can view this with a 'Show-Stream' activity - just to make sure you have correctly setup the input file; this is a two-liner ACTIVITY:
Page-New "ContextPage" (which should be defined on your "Pages and Classes" as being the same class where your HTML rule it).
Show-Stream
When you run this - confirm you see the ALPHA character correctly:
Hi Vandana,
Converting HTML to PDFs that contain non-ASCII characters requires you to check a number of things; Japanese/Chinese and Thai introduce further complexity as well - so I would suggest you first start with confirming you can produce a PDF that contains (say) accented characters - or Greek/Russian instead.
So : here's what I would do first:
1. Create a simple HTML INPUT file - containing a non-ascii character - I'm just using a Greek 'alpha' here:
<html> <head> </head> <body> α </body> </html>
2. First make sure you can view this with a 'Show-Stream' activity - just to make sure you have correctly setup the input file; this is a two-liner ACTIVITY:
Page-New "ContextPage" (which should be defined on your "Pages and Classes" as being the same class where your HTML rule it).
Show-Stream
When you run this - confirm you see the ALPHA character correctly:
3. Check your 'file.encoding' for your JVM: most probably you will want 'UTF-8'.
In order to support non-ascii characters for HTMLTOPDF ; you have to either specify a CHARSET attribute in your input HTML, or (easier in my view) you can specify the JVM-WIDE setting 'file.encoding' property - and set this to 'UTF-8'.
You can check your existing file.encoding property from SMA - it on the very first page , scroll to the bottom:
If this is NOT set to UTF-8 and your input file doesn't provide a CHARSET, this means that generally you are not going to be able to convert anything that isn't it ASCII (you *might* have a wider setting than ASCII there - for instance cp1252, but in general UTF-8 covers this option better)
Assuming you have no other reason to specifiy a different 'file.encoding' than UTF-8; change it to UTF-8 - with the JVM startup option like this:
-Dfile.encoding=utf-8
On Tomcat, this is typically done in your 'setenv.bat/sh' file - on other App Servers, it will be done via the 'admin' screen.
You will need to restart after making this change; check again in SMA that the setting has worked.
4. Make sure you have a TrueType Fonts directory referenced in your input PARAMs for HTMLTOPDF.
As mentioned on earlier replies: you will almost always need to have FONTS directory configured for the ACTIVITY.
I'm testing on Windows here - so I just make sure I pass in the value "C:\WINDOWS\FONTS" into my "pyPDFFontsDirectory" parameter.
On Linux: this path will vary.
You can actually install some Free Microsoft Fonts on your Linux machine by following this article (or similar) : How to install Microsoft fonts in Linux office suites | PCWorld
5. Try converting the single GREEK character to PDF.
So I tested this on my system and can confirm that the PDF contains the Greek character:
Great: that confirms three things:
1. We are using a file.encoding wide-enough for non-ASCII characters. (UTF-8 here).
2. We have configured our FONTs such that one was found when we converted the PDF.
3. Whatever FONT was used: it contains the correct glyph for the Greek 'alpha' character.
6. Now move onto Double-Byte charactersets (Japanese etc):
Setup a simple input HTML as before, but this time include Japanese Characters.
I have just copied the 'Hiragana' character set here from the Wikipedia page : we could also include Katakana and some Kanji as well here:
First: confirm with a Show-Stream (not shown here) - and then run it through the conversion again; on my (windows) system - it works - hurrah !
This is probably the point that you find it doesn't work on your Linux system: Windows comes with a full set of comprehensive FONTs containing 'glyphs' for many languages: your Linux distro may or may not have them all installed by default.
If you find that it still causing issues on your Linux system: move your test to a Windows System (you can even share the same backend DB, so you won't have to re-write any activities or anything) - and confirm you can get that working.
If you can get it working on WINDOWS but not LINUX: then consider this test:
1. COPY your C:\WINDOWS\FONTS directory to a Linux directory.
2. Reference that linux-copy in your activity.
If this works: it is NOT a suitable solution for anything but a test - I am pretty sure Microsoft hold copyright (or something) over those fonts, and you probably shouldn't be using them from Linux.......
Your task will be to find a suitable licensed True Type Font that you can use on your Linux system......
One other thing: you can 'embed' the fonts with HTMLTOPDF: and you may find that you need to use this option - if you are moving your PDF around different operating systems.....
Let us know how you get on, cheers John
Pegasystems Inc.
JP
Cool!
Infosys Limited
US
Hi John
Thank you for the detailed explanation!
I tried to display Greek alpha character you were referring above and it worked fine
I observed that the file encoding parameter is set to UTF-8
And I tried replacing the character with japanese characters and it works in my windows PC
Our Infrastructure team is trying to do a configuration upgrade in RHEL to support international language..I'll post the updates if it works out..
Updated: 9 Sep 2015 5:44 EDT
Pegasystems Inc.
GB
[EDIT: I had forgotten to include the notes about the custom CONTROL used for this - now fixed, cheers John]
Hi Vandana,
I wrote an ACTIVITY a while back to automatically generate a HTML input file that contains various UNICODE character sets : including CJK ones.
I have attached a RAP file of this ACTIVITY (TestConvert) and supporting rules:
It works by setting up some a Page List of properties 'StartCodePoint, EndCodePoint, Title, UNICODE_Reference' using a Data Transform (SetupInputHTMLRulesToProcess)
The 'StartCodePoint', 'EndCodePoint' are taken from the UNICODE reference here: Code Charts
We use a HTML (PDFTestHTML) rule to loop through the Page List like this -
[EDIT: I had forgotten to include the notes about the custom CONTROL used for this - now fixed, cheers John]
Hi Vandana,
I wrote an ACTIVITY a while back to automatically generate a HTML input file that contains various UNICODE character sets : including CJK ones.
I have attached a RAP file of this ACTIVITY (TestConvert) and supporting rules:
It works by setting up some a Page List of properties 'StartCodePoint, EndCodePoint, Title, UNICODE_Reference' using a Data Transform (SetupInputHTMLRulesToProcess)
The 'StartCodePoint', 'EndCodePoint' are taken from the UNICODE reference here: Code Charts
We use a HTML (PDFTestHTML) rule to loop through the Page List like this -
Here's the main 'forEach' loop in that HTML rule as text:
<pega:forEach name=".CharacterSetEntry"> <hr/> <div class="with_font"> <pega:include name="ShowCharacters" type="Rule-HTML-Property"> <pega:param name="Title" ref="$this.Title"/> <pega:param name="StartCodePoint" ref="$this.StartCodePoint"/> <pega:param name="EndCodePoint" ref="$this.EndCodePoint"/> <pega:param name="UNICODE_Reference" ref="$this.UNICODE_Reference"/> </pega:include> </div> </pega:forEach>
This HTM rule makes use of a custom CONTROL ("ShowCharacters"), which does the nitty-gritty of converting the codepoint HEX value to an actual character:
Here's a snippet of the main bit of code above:
[...] <% // TBD: put a safety check in here - sanity-check the input number - stop infinite loops (etc) try { int start=Integer.parseInt( tools.getParamValue( "StartCodePoint" ) ,16 ); int end=Integer.parseInt( tools.getParamValue( "EndCodePoint" ), 16 ); if (end>=start) { for (int i = start; i <= end; i++) { if (!java.lang.Character.isDefined(i)) { i++; continue; } char[] c=java.lang.Character.toChars(i); %> <tr> <td> <%=c%> </td> <td> <%=String.format("0x%02X",i)%> [...]
Then we push the output through HTMLTOPDF - the result is (if everything is working correctly) a PDF containing a range of charactersets including (optionally) CJK characters - and offers some other options (whether to embed fonts, and whether to skip the 'HTML' tidy mechanism).
The result looks like this (attached also):
The RAP file 'PDFTestPageGeneratorV3.jar' is attached: exported from a 717 System - it should work on other versions (it was developed on a V6 system originally in fact).
I haven't actually tried the import on a virgin PRPC717 system - but you should be able to import this as '[email protected]' and then login back as '[email protected]/rules'.
The locked rulesets have the password 'pega'.
Let me know if you have any issues importing or using this - feel free to add your improvements to this and share back to the GCS Community !
Cheers
John
NOTES / TBD for improvement:
It would be nice to generate a Table of Contents (TOC) for the various characters generated at the top of the PDF - with links (if possible) to take you to the correct section.
It would be nice to be able to control which character sets you are interested in.
It would be nice to able to specify a FONT name to use (per charset?)
It *might* be better to use a Data Table to hold the values, rather than a Data Transform here ? (We are pretty much only doing property-sets).
Check that the custom CONTROL is robust: should guard against infinite loops etc.
I also feel that a facility to generate a test PDF like this, would be a useful facility to have OOTB in PRPC - so I logged a feedback item ( FDBK-11033 : 'Generate A Test PDF' - wizard'); so that this proposal can be reviewed by our Product Management teams as to whether this would be a useful and feasible thing to do or not.
Infosys Limited
US
Hi John,
Thank you so much!!
I am trying to relate this to my scenario of generating PDF with Japanese text.We need to generate complete PDF in japanese(entire pdf content should be in japanese).I am setting Customer address value alone(for testing) in below PDF which is getting displayed as '????'
Could you please guide me on how the above solution provided should be incorporated..
Pegasystems Inc.
GB
Hi Vandana,
Is it specifically Japanese Characters that do not work in this scenario - are you able to add in a Greek (say) character into the input document - and if so - does that appear or not ?
Are you embedding the fonts in your output PDF file (if not, try that option first).
If you do a 'show-stream' of the same input HTML - does this display correctly in your browser or not ?
It may be that the FONTs that are being used do not contain the relevant Japanese Glyphs here.
You could also switch on DEBUG logging for the following class:
com.pega.pegarules.integration.engine.internal.util.PDFUtilsImpl
And re-run your test: look for any errors relating to the 'p4ml.properties<xxx>' file in particular....
If you load the JAR file I provided on your test system - can you use that to ascertain which character sets are and are not working ?
Thanks,
John
Infosys Limited
US
Hi John,
Sorry for the delayed response. As you said I tried to set greek characters for two fields and even those characters are getting displyed as '???'
And I couldn't find this option 'com.pega.pegarules.integration.engine.internal.util.PDFUtilsImpl' in the Logger.
Server upgrade is still in progress...I will update if it works out..
Pegasystems Inc.
GB
Hi Vandana,
mmmh: I wonder if the Property-Set is working properly here ?
Can you alter your test to use a HTML rule (as before) - as this is known to work . We can look into whether the Property-Set is working as correctly following that.....
Cheers
John
Pegasystems Inc.
GB
Pegasystems Inc.
GB
Hi Vadana,
I'm a bit lost here though: you previously demonstrated that you were able to convert an Alpha character ?
Is this a different environment you are working on here ?
Thanks again!
John
Infosys Limited
US
Hi John,
Sorry for the confusion.
I meant I am able to see alpha character when activity is run by adding Show-Stream..
Pegasystems Inc.
JP
If you are looking for open source CJK font, see http://www.google.com/get/noto/#/family/noto-sans-jpan
Updated: 9 Sep 2015 11:56 EDT
Pegasystems Inc.
GB
Hi Chunzhi,
Thanks for the google fonts link : I tried this, but so far getting mixed-results from the 'PDFTestPageGenerator'.
I changed the input HTML file (so that it no longer hardcoded the Microsoft Tahoma font); and instead changed to:
<html> <head> <title> PDF FONTS TEST PAGE </title> <style type="text/css"> .with_font { font-family: "Noto Sans", "Noto Sans CJK JP", sans-serif; } table { border-collapse: collapse; width: 50%; } [...]
And I repointed my activity to use my 'C:\NOTO' fonts (on my server, I didn't install them on my client).
I ran the activity with EMBED fonts switched on:
And this correctly (I can see the font has changed) converted GREEK ok, but none of the others....
Hi Chunzhi,
Thanks for the google fonts link : I tried this, but so far getting mixed-results from the 'PDFTestPageGenerator'.
I changed the input HTML file (so that it no longer hardcoded the Microsoft Tahoma font); and instead changed to:
<html> <head> <title> PDF FONTS TEST PAGE </title> <style type="text/css"> .with_font { font-family: "Noto Sans", "Noto Sans CJK JP", sans-serif; } table { border-collapse: collapse; width: 50%; } [...]
And I repointed my activity to use my 'C:\NOTO' fonts (on my server, I didn't install them on my client).
I ran the activity with EMBED fonts switched on:
And this correctly (I can see the font has changed) converted GREEK ok, but none of the others....
Can anybody else try this ? Can they get it to work ?
(I think it must be the styles/fonts settings that need to be changed....)
EDIT: instead of using a separate 'NOTO' directory - I actually 'installed' the FONTS (dragged them to my C:\WINDOWS\FONTS directory) and re-ran the ACTIVITY using again 'C:\WINDOWS\FONTS' as my fonts directory (and altering the CSS to reference the font name) - but still no luck.....I'm not sure that PD4ML (the underlying library) isn't 'falling-back' to using a different font when it can't find the font I specified.
I also don't know what the link between the name of the FONT "Noto Sans CJK JP' and the name of the OTF file is held ? Anyone know about fonts ?
Pegasystems Inc.
JP
Hi John,
It seems we need to use "@font-face" to specify the location of the font file, see attached HTML sample.
and due to the large size of the original font file, making a sbuset is the more practicle way to apply Noto CJK font.
You can download NotoCJKJP Subset here
Another pitfall is that you have to restart the browser to make the newly installed noto CJK font to take effect.
Pegasystems Inc.
GB
Thanks for this - I got a mixed result when trying to convert this to PDF....
So the first font worked (and I can just about read it - 'Hi Ra Ga Na. Ka Ta Ka Na' (I am a lapsed student of evening classes of Japanese but I never did learn Kanji properly ;-) )
I'm not sure why this, because you seem to have set up the CSS @fontface directives in the same way : and I have all the Noto Fonts downloaded to my local system....
Thanks for this - I got a mixed result when trying to convert this to PDF....
So the first font worked (and I can just about read it - 'Hi Ra Ga Na. Ka Ta Ka Na' (I am a lapsed student of evening classes of Japanese but I never did learn Kanji properly ;-) )
I'm not sure why this, because you seem to have set up the CSS @fontface directives in the same way : and I have all the Noto Fonts downloaded to my local system....
<!DOCTYPE html> <html> <head> <style> @font-face { font-family: 'notocjkjp'; font-weight: 100; src: local('Noto Sans CJK JP'), /* if the viewer of this html page has installed the noto cjk font locally, then use local font */ url('file://C:/NOTO/NotoSansCJKjp-Thin.otf'); /* if local font couldn't be foud, download it from the specified URL */ } @font-face { font-family: 'notocjkjp'; font-weight: 200; src: local('Noto Sans CJK JP'), url('file://C:/NOTO/NotoSansCJKjp-Light.otf'); } @font-face { font-family: 'notocjkjp'; font-weight: 300; src: local('Noto Sans CJK JP'), url('file://C:/NOTO/NotoSansCJKjp-DemiLight.otf');} @font-face { font-family: 'notocjkjp'; font-weight: 400; src: local('Noto Sans CJK JP'), url('file://C:/NOTO/NotoSansCJKjp-Regular.otf');} @font-face { font-family: 'notocjkjp'; font-weight: 500; src: local('Noto Sans CJK JP'), url('file://C:/NOTO/NotoSansCJKjp-Medium.otf');} @font-face { font-family: 'notocjkjp'; font-weight: 600; src: local('Noto Sans CJK JP'), url('file://C:/NOTO/NotoSansCJKjp-Bold.otf');} @font-face { font-family: 'notocjkjp'; font-weight: 700; src: local('Noto Sans CJK JP'), url('file://C:/NOTO/NotoSansCJKjp-Black.otf');} .notocjkjp_thin { font-family: 'notocjkjp'; font-weight: 100;} .notocjkjp_light { font-family: 'notocjkjp'; font-weight: 200;} .notocjkjp_demiLight { font-family: 'notocjkjp'; font-weight: 300;} .notocjkjp_regular { font-family: 'notocjkjp'; font-weight: 400;} .notocjkjp_medium { font-family: 'notocjkjp'; font-weight: 500;} .notocjkjp_bold { font-family: 'notocjkjp'; font-weight: 600;} .notocjkjp_black { font-family: 'notocjkjp'; font-weight: 700;} [...]
I tried converting with both EMBEDDED fonts and NON-EMBEDDED fonts (the filesize comes out at ~50k for the former and ~8k for the latter - so I think the mechanism is working) - but when I view the Properties|Fonts (I'm using Foxit here) I see that it hasn't used the Fonts specified in the HTML - it has fallen back on 'MS-Mincho' it seems ?
The other thing probably worth noting here : I actually extracted the Noto Fonts to my C:\WINDOWS\FONTS directory as WELL as my a directory called C:\NOTO.
If I specify my FONTS param ("pyPDFFontsDirectory") as C:\WINDOWS\FONTS - I get the result above (one good line of Japanese Characters, rest are question-marks ('?') characters).
If instead I specify the directory C:\NOTO - then I get ALL question marks:
So I don't think the underlying PD4ML library is choosing the correct Font here - and is defaulting to 'MS-Mincho' for the first line only.....
I copied the Microsoft Font file 'msmincho.ttc' to my C:\NOTO directory and tried again - similar result - although interestingly the English text at the bottom now changes fonts as well:
So I don't think the underlying PD4ML library is picking up the Font specified in the original HTML for some reason here ?
Pegasystems Inc.
JP
Actually the first line of Japanese string doesn't have any style specified.
It seems that with c:/windows/noto directory pd4ml can find a default font, but with c:/noto it can't.
I have got some feedback from Krishna in kanban team, https://mesh.pega.com/message/192804?et=watches.email.thread#192804
Updated: 10 Sep 2015 4:21 EDT
Pegasystems Inc.
JP
Just heard from my teammate that they have noticed the same issue with creating PDF which contains Japanese characters.
SR-A7487 has been raised.
Pegasystems Inc.
JP
It turns out that PD4ML is very picky with the name of font files.
Instead of "Noto Sans CJK JP", I used more specific font name like "Noto Sans CJK JP Thin" or "Noto Sans CJK JP Black" which appears at the top of the font file display window when you double click a font file, and then the PDF could show the Japanese characters, but the layout was unacceptable, see attached PDF.
With default windows fonts, I could only success with "Arial Unicode MS" and "MS UI Gothic".
Pegasystems Inc.
GB
I see what you mean: the formatting for of the NOTO-fonts seems off - but in other cases it's ok ?
mmmh - definately good progress though ! - I can see from the attached PDF (again, Foxit) that the fonts are being embedded:
I can also select out the text and paste into here (and other apps - such as notepad) and the formatting seems to sort itself out then:
Black ã²ãããªã«ã¿ã«ãæ¼¢å
(of course: I *think* the font has actually changed here once I copied/pasted it : but the codepoints are preserved...)
Can you supply the HTML rule that you used as well ?
Cheers
John
Pegasystems Inc.
JP
Hi John,
Attached is the html that was used to generate PDF.
CZ
Pegasystems Inc.
JP
Just found another open source Japanese true type fonts here, IPAãã©ã³ãã®ãã¦ã³ãã¼ã.
You can use the attached HTML to test the IPA fonts.
Infosys Limited
US
Hi Chunzhi & John,
Thank you very much for the information.We have an issue in trying to upload different fonts and try it as it includes multiple approvals from Client side.We took it to the PEGA team who will support us here.We are already generating reports through ADOBE.So planning to implement it until the issue gets resolved.
Regards
Vandana B
-
John Pritchard-Williams