Question
Rabobank
NL
Last activity: 4 Oct 2018 13:54 EDT
PDF Parsing to Clipboard
I need to Parse a PDF from external vendor. Which is not an eForm.
Can any one suggest on this. What can be the best approach. Should be decode the PDF into Base64 data?. Also after parsing i have to update a decision table.
Kindly help on this as this is urgent requirement.
**Moderation Team has archived post**
This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.
-
Like (0)
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Pegasystems Inc.
GB
If you want to extract all the text of the PDF into a PRPC Text Property - you could start by using the 'PDFBox' library - this is inlcuded in PRPC by default.
See the PDBox utility class "PDFTextStripper"
https://pdfbox.apache.org/docs/1.8.10/javadocs/org/apache/pdfbox/util/PDFTextStripper.html
I would suggest something like this:
1. Upload a test PDF as a 'Binary' File.
2. Create a Test Actvity which does an OBJ-OPEN on that binary file.
3. Extract the 'pyFileSource' property (which is base64 encoded): convert this base64 into a byte array (there is an OOTB prpc function for this - I can't quite remember the name at this point).
4. In a Java Step : create an instance of the PDFBox class 'org.apache.pdfbox.pdmodel.PDDocument'
5. Create an instance of a PDFTextStripper() - and extract the text to a Java String. (define a 'local' variable for this on the PARAMs tab).
6. Transfer the Local Variable holding the text into a PRPC Text Property.
Take a look at the OOTB activities HTMLTOPDF and Code-Pega-PDF.View: for examples of how to deal with byte arrays that represent PDFs.
Updated: 12 Feb 2016 9:12 EST
Pegasystems Inc.
GB
As per my notes above : the following works on PRPC 72: (probably lower versions as well: but not checked):
JAVA STEP (Step #4) is:
As per my notes above : the following works on PRPC 72: (probably lower versions as well: but not checked):
JAVA STEP (Step #4) is:
// See also: https://stackoverflow.com/questions/14700241/remove-encryption-from-pdf-file-using-apache-pdfbox/14700523 com.pega.apache.pdfbox.pdmodel.PDDocument doc=null; com.pega.apache.pdfbox.util.PDFTextStripper pdfStripper; java.io.InputStream is = new java.io.ByteArrayInputStream( Base64Util.decodeToByteArray( B64Data ) ); try { doc=com.pega.apache.pdfbox.pdmodel.PDDocument.load( is ); if (doc.isEncrypted()) { oLog.info("Document is encrypted: trying to decrypt with blank password"); try { doc.decrypt(""); doc.setAllSecurityToBeRemoved(true); } catch(Exception e) { throw new PRRuntimeException(e); } } pdfStripper=new com.pega.apache.pdfbox.util.PDFTextStripper(); ExtractedText=pdfStripper.getText(doc); } catch(Exception e){ throw new PRRuntimeException(e); } finally { if (doc!=null) { try { doc.close(); }
You need to upload a PDF to a Rule File Binary: I used the Pega 7.2 Platform Upgrade Guide
(But this is just for testing: you could also (for instance) create PDFs in memory using HTMLTOPDF or fetch a PDF from a website etc : so long as you can get the PDF bytes, the same approach should work).
Running the Activity shows the extracted text in the Clipboard Property.
You would need to write additional logic to parse the text of course; or you could use different PDFBox APIs to parse the structure of the PDF in a different way (probably).
-
Saheli Ghosh
Rabobank
NL
Thanks John,
Really it was helpful. I am trying to parse it now from the String data.
Here i was thinking instead of text, if we can directly parse to XML than it would be more easier.
Thanks,
Sumit
Pegasystems Inc.
GB
It would be easier if it was XML, the tricky bit is getting it into XML though ! :-)
Are you dealing with known structures of PDFs as input ? Are you only interested in particular bits of the document ?
Are you able to provide an example PDF ?
PDFBox has more APIs than just extracting all the text - you will need to check the Javadocs though for all the features though !
Cheers
John
Rabobank
NL
Thanks John,
Yes we can consider it in structured format. PFA S/S.
Also I want to mention the header is (Business partner and Domain(s) ) is not available on all pages. This header is available only on first page.
I was trying to parse text on some pattern basis(.CO/.COM) , but it is not appropriate.
Can you please guide me the API function for XML conversion, if some thing is there.
Thanks,
Sumit
Updated: 15 Feb 2016 10:16 EST
Pegasystems Inc.
GB
Hi Sumit,
Thanks for the additional information : you should realize that PDFs are not simply a 'wrapper' with some hidden XML inside them ; they are a printable/presentation format - so structures like TABLEs etc (that exist in HTML/XML) are not necessarily present in a nice easy-to-parse format.
You can get at strucutures such as 'pages' within PDFs if that will help you : see this StackOverFlow post for more information on that. Possibly you can get other structures such as paragraphs or blocks of text ; but I've never gone to that level myself : the PDFBox (or perhaps 'itext' : which is also present in PRPC OOTB [ although it is quite an old version]) Javadocs/examples may provide examples.
See this StackOverFlow post for more information. Which references Adobe's specification for PDFs as well.
Hi Sumit,
Thanks for the additional information : you should realize that PDFs are not simply a 'wrapper' with some hidden XML inside them ; they are a printable/presentation format - so structures like TABLEs etc (that exist in HTML/XML) are not necessarily present in a nice easy-to-parse format.
You can get at strucutures such as 'pages' within PDFs if that will help you : see this StackOverFlow post for more information on that. Possibly you can get other structures such as paragraphs or blocks of text ; but I've never gone to that level myself : the PDFBox (or perhaps 'itext' : which is also present in PRPC OOTB [ although it is quite an old version]) Javadocs/examples may provide examples.
See this StackOverFlow post for more information. Which references Adobe's specification for PDFs as well.
Are you always looking for URLs in the PDFs ? Because you can probably use 'REGEX' for this - perhaps you will need a 'human-approval' stage at the end of this, but it should be able to grab a lot of the information that way ?
(I'm not sure why you said looking for 'co' 'com' is not appropriate here ? Do you mean it doesn't find all the text you need ?)
Additionally: are all the PDFs essentially comprised of two columns of data ?
You should be able to use the API to differentiate between text on the Left-Hand-Side from text on the Right-Hand-Side if so : also you *might* be able to use the background colour here to help you identify the text as well?)
One more thing: the PDFTextStripper -should have returned you a big block of text; that includes Line Endings : so you should be able to parse this text one line at a time ; which should then allow you to start locating the text you need ?
Thanks,
John
Virtusa
AU
Thanks John, this was really helpful for solving one of my issue. Can you please suggest using the same approach, I need to parse PDF attached to my Case. These PDF's will be attached during the case creation through email Listener. I need to parse some info from PDF and show on Case UI.
Instellars Global Consulting
IN
Hi,
I have a requirement:
When a user uploads a PDF file, it should be parsed into XML which will be sent as an input for SOAP integration.
The PDF shouldn't get attached to the case.
Please could you suggest how to achieve this?
Regards,
Benny
Capgemini
IN
Hi ,
I have a requirement to parse PDF using eForm .I am using the activity ExtractDataFromEForm and able to get the binary data on pyEform but getting the error that unable to extract data from pyEform. I m using 7.1.9.
Please help on this .
Bank of Nova Scotia
CA
Hey Folks,
Has anyone successfully implemented this? Open a Fillable PDF in a section and then Edit it and on submit save it back to the case as an attachment?
I am trying to achieve this without any plugin actually (If at all Possible!!)