How to parse a word document ? Do we have a OOB option without involving much Java code? Am using 7.1.8 now. Need help please !!!

Question

ravisharma1986

Member since 2024

7 posts

HSBC

Posted: Jun 21, 2016

Last activity: Jul 12, 2016

Posted: 21 Jun 2016 17:37 EDT
Last activity: 12 Jul 2016 9:39 EDT

Closed

Solved

How to parse a word document ? Do we have a OOB option without involving much Java code? Am using 7.1.8 now. Need help please !!!

Report

Need help on how can we parse a word document in PRPC 7.1.8. Am not so familiar with Java so any help hinting some Java. Code if needed would be great !

Message was edited by: Marissa Rogers - Added Category

To see attachments, please log in.

Data Integration

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Accepted Solution

Posted: 9 years ago

Posted: 1 Jul 2016 11:02 EDT

ravisharma1986

HSBC

replied to ravisharma1986

Report

Thanks a lot Osborn for such a detailed explanation. This is very helpful.

However I tried approaching this requirement with a different method ,

Step 1 : I converted the Word Document into PDF editable document. Where I can enter data onto PDF like we do on word using Adobe Acrobat.

Step 2 : I used the eForm Accelerator to create eFrom rule as well as corresponding Map eForm rule using the PDF I created.

Step 3 : Then I created an activity to call OOTB ExtractDataFromeFrom activity and passing all the mandatory parameters correctly I was able to map the data entered into the PDF document onto my clipboard.

I understand I had to change the file type but this way I was easily able grab the data I needed from a fixed format editable PDF document.

Thanks a lot everyone who helped me in this discussion. Appreciate the community.

Thanks

Ravi Sharma

View reply inline

To see attachments, please log in.

Posted: 9 years ago

Posted: 23 Jun 2016 6:06 EDT

BaigHabeeb

Virtusa IT Consulting

replied to ravisharma1986

Report

Hi Ravi,

Please check this mesh discussion Re: Mapping word document to pega I think you have a similar requirement. As far I know there is no OOTB activity to parse a word document but there are activities to parse excel and PDF files.

Thanks,

Habeeb Baig

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 23 Jun 2016 9:38 EDT

ravisharma1986

HSBC

replied to BaigHabeeb

Report

Habeeb Baig,

Sorry I am unable to access that URL / Link that you have shared. May be its removed or restricted access. Can you please share the content that's discussed there ?

Thanks

Ravi Sharma

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 23 Jun 2016 11:49 EDT

mjosborn85

Jabil

replied to ravisharma1986

Report

Ravi, Can you comment further on the parsing objectives? We might be able to advise depending on what you're doing after that.

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 23 Jun 2016 14:13 EDT

MikeTownsend_GCS replied to mjosborn85

Report

Ravi,

I agree with Matthew that we need more information to be truly helpful. What is the business problem you are ultimately trying to solve? There isn't an easy out of the box word parsing API in Pega, so perhaps you would be better served by having us help you find an alternate approach that meets your business needs.

Thanks,

Mike

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 24 Jun 2016 11:50 EDT

ravisharma1986

HSBC

replied to ravisharma1986

Report

We are trying to parse the data present in a .docx file . Data is structured in a word doc in tables Yes/No as checkboxes. Requirement is when we upload such a word doc in PRPC we should be able to parse this information and map it to properties.

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 24 Jun 2016 12:52 EDT

MikeTownsend_GCS replied to ravisharma1986

Report

Ravi,

The good news is the data is fairly well structured. While I don't know of a specific API, if docx is an XML file at heart (not really my area), you might be able to use ParseXML rules to digest it and get the values on the clipboard?

Thanks,

Mike

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 24 Jun 2016 13:06 EDT

mjosborn85

Jabil

replied to MikeTownsend_GCS

Report

Yeah, the bad news is all the OOB utilities in this neighborhood are too specific to help with the first step. Good news is that it really isn't that complicated. It is Java, though.

Ravi Sharma, the secret is that the .docx file is just a zip file containing XML. The technique is to identify which zip file entry contains your checkbox, read its XML to a String, parse the string for your values, handoff to the clipboard. If you wish to keep the Java to an absolute minimum, a small function would take some sort of reference to your file and return the string of the zipfile entry. Do everything else in an activity or even data transform. Everything is done in-memory.

I happen to have a junior developer working on something very similar right now, except we are performing a template replacement on a source document, re-zipping, and sending back to the browser -- all in memory. Mike Townsend, does Pega care if we share the shape of that solution here? I imagine many people have wished they new how to do this.

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 24 Jun 2016 13:17 EDT

MikeTownsend_GCS replied to mjosborn85

Report

Matthew,

If you have some sample code that solves this problem that other folks could use to kickstart their customization and you want share, I'm fine with it. To the best of my knowledge, Pega doesn't object either. If I'm wrong, I'm sure someone will chime in and correct me.

Thanks,

Mike

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 25 Jun 2016 20:21 EDT

ravisharma1986

HSBC

replied to mjosborn85

Report

I would request kindly share the code that is specific to use the parsing of docx that can be used in an activity that will be greatly helpful.

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 26 Jun 2016 10:54 EDT

mjosborn85

Jabil

replied to ravisharma1986

Report

The thing to remember is that MS Office documents are just a collection of XML files zipped up. The Java APIs present a virtual filesystem for the file. See https://docs.oracle.com/javase/7/docs/api/index.html?java/util/zip/package-summary.html

Treat the zip contents like you would a regular file on disk -- don't overcomplicate it. Unless the files are very large, you do not need to write anything to disk in your Pega routines -- everything is in memory.

Step 1: Prepare for the Pega coding by unzipping the docx file. Use a command line utility, or else rename the .docx extension to .zip to more easily use your favorite unzip tool. Whatever works. The Java jar command works well...

> jar -tf test.docx

> jar -xvf test.docx

Step 2: Find your the XML file that contains whatever you are interested in. It's just text, so a search tool or IDE will work well. In my simple example the phrase I typed in Word was at word/document.xml.

> jar -tf test.docx

> jar -xvf test.docx

Step 3: Try this in an IDE like Eclipse or Netbeans. Please be kind to yourself: Unless you are an old pro, don't develop Java inside Pega. Death by a thousand cuts. After you run perform(), exercise zipUp() for fun and create a zip file from thin air.

package replaceme;

import java.io.*;

import java.util.zip.*;

public class TestClass {

    public static void main(String[] args) {

        TestClass tc = new TestClass();

        tc.perform();

        // tc.zipUp();

    }

    private void perform() {

        try {

            ZipInputStream zis = new ZipInputStream(new FileInputStream(new File("path-to-my-file/test.docx")));

            ZipEntry entry = null;

            byte[] b = null;

            while (null != (entry = zis.getNextEntry())) {

                //     System.out.println(entry.getName() + " " + entry.isDirectory());

                if ("word/document.xml".equals(entry.getName())) {

                    int size = (int)entry.getSize();

                    b = new byte[size];

                    zis.read(b, 0, size);

                }

            }

            zis.close();

            String s = new String(b);

            System.out.println(s);

            System.out.println(s.contains("Some text I’m looking for"));

        } catch (Exception e) {

           e.printStackTrace();

        }

    }

    private void zipUp() {

        try {

            ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(new File("path-to-create-my-outputfile/testout.zip")));

            ZipEntry ze1 = new ZipEntry("foo.txt");

            ZipEntry ze2 = new ZipEntry("bar.txt");

            byte[] b = "This little piggy went to market.".getBytes();

            zos.putNextEntry(ze1);

            zos.write(b, 0, b.length);

            b = "This little piggy stayed home.".getBytes();

            zos.putNextEntry(ze2);

            zos.write(b, 0, b.length);

            zos.close();

        } catch(Exception e) {

            e.printStackTrace();

        }

    }

}

Step 4: Once you understand the above, revise this code to work into your Pega activity. You are responsible for obtaining a page reference to your Word document. For this example I attached a file to a work object and examined the clipboard to find the pzInsKey. I won't belabor how Pega stores binary content... it's all in the developer help.

Pay attention to the file encoding of the Base64 decode call. This could vary depending on your desktop settings. If in doubt, UTF-8 might be forgiving. Also note that Pega has a couple functions for UTF-8 decoding, but they wrap only the purely string-based decode calls and are useless here. Consider writing your own.

If you haven't done this before, notice the pageContent variable. This Local Variable is declared on the Parameters tab. Then you can use it outright in a Java step and refer to it in other activity steps with the "local." prefix...

... continued image....

If you run that activity, the last step will look odd because Pega won't understand how to render that XML. Nevertheless, I was able to see the text from my Word document. As for parsing your Word text and mapping the result to clipboard properties, I recommend regular expressions for the former, and using a local variable in the Java step for the latter.

Illustrating all this in an Activity is easiest to understand, but personally I would put the Java in one or two Pega Functions. Maybe work in safety sanity checks for the size of the file or content.

Cheers!

Show Less

To see attachments, please log in.

Like (0)

Posted: 9 years ago

Posted: 24 Jun 2016 18:03 EDT

Mr. Ravi Kumar Pisupati

GovCIO

replied to ravisharma1986

Report

Ravi and I had an offline discussion on this requirement and I requested him to post the query over here as I didn't work on this requirement earlier. But I would like to know the java code or whatever the code which solves his current issue. I will make a note of it and will see if it is going to be useful in my future req stand point.

Thanks,

Ravi Kumar.

To see attachments, please log in.

Like (0)

Accepted Solution

Posted: 9 years ago

Posted: 1 Jul 2016 11:02 EDT

ravisharma1986

HSBC

replied to ravisharma1986

Report

Thanks a lot Osborn for such a detailed explanation. This is very helpful.

However I tried approaching this requirement with a different method ,

Step 1 : I converted the Word Document into PDF editable document. Where I can enter data onto PDF like we do on word using Adobe Acrobat.

Step 2 : I used the eForm Accelerator to create eFrom rule as well as corresponding Map eForm rule using the PDF I created.

I understand I had to change the file type but this way I was easily able grab the data I needed from a fixed format editable PDF document.

Thanks a lot everyone who helped me in this discussion. Appreciate the community.

Thanks

Ravi Sharma

To see attachments, please log in.

Like (0)

Question

How to parse a word document ? Do we have a OOB option without involving much Java code? Am using 7.1.8 now. Need help please !!!

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

How to parse a word document ? Do we have a OOB option without involving much Java code? Am using 7.1.8 now. Need help please !!!

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.