Support Center

Question

Sumit_PEGA

Member since 2015

6 posts

Rabobank

Posted: Feb 12, 2016

Last activity: Oct 4, 2018

Posted: 12 Feb 2016 5:10 EST
Last activity: 4 Oct 2018 13:54 EDT

Closed

Solved

PDF Parsing to Clipboard

Report

I need to Parse a PDF from external vendor. Which is not an eForm.

Can any one suggest on this. What can be the best approach. Should be decode the PDF into Base64 data?. Also after parsing i have to update a decision table.

Kindly help on this as this is urgent requirement.

**Moderation Team has archived post**

This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.

To see attachments, please log in.

Data Integration

Low-Code App Development

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Posted: 9 years ago

Posted: 12 Feb 2016 5:22 EST

JOHNPW_GCS replied to Sumit_PEGA

Report

If you want to extract all the text of the PDF into a PRPC Text Property - you could start by using the 'PDFBox' library - this is inlcuded in PRPC by default.

See the PDBox utility class "PDFTextStripper"

https://pdfbox.apache.org/docs/1.8.10/javadocs/org/apache/pdfbox/util/PDFTextStripper.html

I would suggest something like this:

1. Upload a test PDF as a 'Binary' File.

2. Create a Test Actvity which does an OBJ-OPEN on that binary file.

3. Extract the 'pyFileSource' property (which is base64 encoded): convert this base64 into a byte array (there is an OOTB prpc function for this - I can't quite remember the name at this point).

4. In a Java Step : create an instance of the PDFBox class 'org.apache.pdfbox.pdmodel.PDDocument'

5. Create an instance of a PDFTextStripper() - and extract the text to a Java String. (define a 'local' variable for this on the PARAMs tab).

6. Transfer the Local Variable holding the text into a PRPC Text Property.

Take a look at the OOTB activities HTMLTOPDF and Code-Pega-PDF.View: for examples of how to deal with byte arrays that represent PDFs.

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Updated: 9 years ago

Posted: 12 Feb 2016 8:58 EST
Updated: 12 Feb 2016 9:12 EST

JOHNPW_GCS replied to JOHNPW_GCS

Report

As per my notes above : the following works on PRPC 72: (probably lower versions as well: but not checked):

JAVA STEP (Step #4) is:

As per my notes above : the following works on PRPC 72: (probably lower versions as well: but not checked):

JAVA STEP (Step #4) is:

// See also: https://stackoverflow.com/questions/14700241/remove-encryption-from-pdf-file-using-apache-pdfbox/14700523

com.pega.apache.pdfbox.pdmodel.PDDocument doc=null;
com.pega.apache.pdfbox.util.PDFTextStripper pdfStripper;

java.io.InputStream is = new java.io.ByteArrayInputStream( Base64Util.decodeToByteArray( B64Data ) );
try {
    doc=com.pega.apache.pdfbox.pdmodel.PDDocument.load( is );
    if (doc.isEncrypted()) {
        oLog.info("Document is encrypted: trying to decrypt with blank password");
        try {
          doc.decrypt("");
          doc.setAllSecurityToBeRemoved(true);
        }
        catch(Exception e) {
           
            throw new PRRuntimeException(e);
        }
    }
    pdfStripper=new com.pega.apache.pdfbox.util.PDFTextStripper();
    ExtractedText=pdfStripper.getText(doc);
}
catch(Exception e){ throw new PRRuntimeException(e); }
finally {
  if (doc!=null) {
    try { doc.close(); }

You need to upload a PDF to a Rule File Binary: I used the Pega 7.2 Platform Upgrade Guide

(But this is just for testing: you could also (for instance) create PDFs in memory using HTMLTOPDF or fetch a PDF from a website etc : so long as you can get the PDF bytes, the same approach should work).

Running the Activity shows the extracted text in the Clipboard Property.

You would need to write additional logic to parse the text of course; or you could use different PDFBox APIs to parse the structure of the PDF in a different way (probably).

Show Less

To see attachments, please log in.

Likes (1)

Saheli Ghosh

Posted: 9 years ago

Posted: 12 Feb 2016 12:08 EST

Sumit_PEGA

Rabobank

replied to JOHNPW_GCS

Report

Thanks John,

Really it was helpful. I am trying to parse it now from the String data.

Here i was thinking instead of text, if we can directly parse to XML than it would be more easier.

Thanks,

Sumit

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 12 Feb 2016 12:46 EST

JOHNPW_GCS replied to Sumit_PEGA

Report

It would be easier if it was XML, the tricky bit is getting it into XML though ! :-)

Are you dealing with known structures of PDFs as input ? Are you only interested in particular bits of the document ?

Are you able to provide an example PDF ?

PDFBox has more APIs than just extracting all the text - you will need to check the Javadocs though for all the features though !

Cheers

John

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 15 Feb 2016 9:57 EST

Sumit_PEGA

Rabobank

replied to JOHNPW_GCS

Report

Thanks John,

Yes we can consider it in structured format. PFA S/S.

Also I want to mention the header is (Business partner and Domain(s) ) is not available on all pages. This header is available only on first page.

I was trying to parse text on some pattern basis(.CO/.COM) , but it is not appropriate.

Can you please guide me the API function for XML conversion, if some thing is there.

Thanks,

Sumit

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Updated: 9 years ago

Posted: 15 Feb 2016 10:10 EST
Updated: 15 Feb 2016 10:16 EST

JOHNPW_GCS replied to Sumit_PEGA

Report

Hi Sumit,

Thanks for the additional information : you should realize that PDFs are not simply a 'wrapper' with some hidden XML inside them ; they are a printable/presentation format - so structures like TABLEs etc (that exist in HTML/XML) are not necessarily present in a nice easy-to-parse format.

You can get at strucutures such as 'pages' within PDFs if that will help you : see this StackOverFlow post for more information on that. Possibly you can get other structures such as paragraphs or blocks of text ; but I've never gone to that level myself : the PDFBox (or perhaps 'itext' : which is also present in PRPC OOTB [ although it is quite an old version]) Javadocs/examples may provide examples.

See this StackOverFlow post for more information. Which references Adobe's specification for PDFs as well.

Hi Sumit,

See this StackOverFlow post for more information. Which references Adobe's specification for PDFs as well.

Are you always looking for URLs in the PDFs ? Because you can probably use 'REGEX' for this - perhaps you will need a 'human-approval' stage at the end of this, but it should be able to grab a lot of the information that way ?

(I'm not sure why you said looking for 'co' 'com' is not appropriate here ? Do you mean it doesn't find all the text you need ?)

Additionally: are all the PDFs essentially comprised of two columns of data ?

You should be able to use the API to differentiate between text on the Left-Hand-Side from text on the Right-Hand-Side if so : also you *might* be able to use the background colour here to help you identify the text as well?)

One more thing: the PDFTextStripper -should have returned you a big block of text; that includes Line Endings : so you should be able to parse this text one line at a time ; which should then allow you to start locating the text you need ?

Thanks,

John

Show Less

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 29 Mar 2018 5:55 EDT

NavakanthMannem

Macquarie Bank

replied to JOHNPW_GCS

Report

Thanks John, this was really helpful for solving one of my issue. Can you please suggest using the same approach, I need to parse PDF attached to my Case. These PDF's will be attached during the case creation through email Listener. I need to parse some info from PDF and show on Case UI.

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 4 Jul 2016 23:16 EDT

BENNVINO

Instellars Global Consulting

replied to Sumit_PEGA

Report

Hi,

I have a requirement:

When a user uploads a PDF file, it should be parsed into XML which will be sent as an input for SOAP integration.

The PDF shouldn't get attached to the case.

Please could you suggest how to achieve this?

Regards,

Benny

To see attachments, please log in.

Like (0)

Posted: 8 years ago

Posted: 12 Aug 2016 1:35 EDT

SUSHILTHAKUR

Capgemini

replied to Sumit_PEGA

Report

Hi ,

I have a requirement to parse PDF using eForm .I am using the activity ExtractDataFromEForm and able to get the binary data on pyEform but getting the error that unable to extract data from pyEform. I m using 7.1.9.

Please help on this .

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 18 Aug 2017 1:46 EDT

PinakiBhattacharya

Bank of Nova Scotia

replied to Sumit_PEGA

Report

Hey Folks,

Has anyone successfully implemented this? Open a Fillable PDF in a section and then Edit it and on submit save it back to the case as an attachment?

I am trying to achieve this without any plugin actually (If at all Possible!!)

To see attachments, please log in.

Like (0)

Question

PDF Parsing to Clipboard

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

PDF Parsing to Clipboard

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.