Question
HSBC
IN
Last activity: 12 Jul 2016 9:39 EDT
How to parse a word document ? Do we have a OOB option without involving much Java code? Am using 7.1.8 now. Need help please !!!
Need help on how can we parse a word document in PRPC 7.1.8. Am not so familiar with Java so any help hinting some Java. Code if needed would be great !
Message was edited by: Marissa Rogers - Added Category
-
Like (0)
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Accepted Solution
HSBC
IN
Thanks a lot Osborn for such a detailed explanation. This is very helpful.
However I tried approaching this requirement with a different method ,
Step 1 : I converted the Word Document into PDF editable document. Where I can enter data onto PDF like we do on word using Adobe Acrobat.
Step 2 : I used the eForm Accelerator to create eFrom rule as well as corresponding Map eForm rule using the PDF I created.
Step 3 : Then I created an activity to call OOTB ExtractDataFromeFrom activity and passing all the mandatory parameters correctly I was able to map the data entered into the PDF document onto my clipboard.
I understand I had to change the file type but this way I was easily able grab the data I needed from a fixed format editable PDF document.
Thanks a lot everyone who helped me in this discussion. Appreciate the community.
Thanks
Ravi Sharma
Virtusa IT Consulting
AE
Hi Ravi,
Please check this mesh discussion Re: Mapping word document to pega I think you have a similar requirement. As far I know there is no OOTB activity to parse a word document but there are activities to parse excel and PDF files.
Thanks,
Habeeb Baig
HSBC
IN
Habeeb Baig,
Sorry I am unable to access that URL / Link that you have shared. May be its removed or restricted access. Can you please share the content that's discussed there ?
Thanks
Ravi Sharma
Jabil
US
Ravi, Can you comment further on the parsing objectives? We might be able to advise depending on what you're doing after that.
Pegasystems Inc.
US
Ravi,
I agree with Matthew that we need more information to be truly helpful. What is the business problem you are ultimately trying to solve? There isn't an easy out of the box word parsing API in Pega, so perhaps you would be better served by having us help you find an alternate approach that meets your business needs.
Thanks,
Mike
HSBC
IN
We are trying to parse the data present in a .docx file . Data is structured in a word doc in tables Yes/No as checkboxes. Requirement is when we upload such a word doc in PRPC we should be able to parse this information and map it to properties.
Pegasystems Inc.
US
Ravi,
The good news is the data is fairly well structured. While I don't know of a specific API, if docx is an XML file at heart (not really my area), you might be able to use ParseXML rules to digest it and get the values on the clipboard?
Thanks,
Mike
Jabil
US
Yeah, the bad news is all the OOB utilities in this neighborhood are too specific to help with the first step. Good news is that it really isn't that complicated. It is Java, though.
Ravi Sharma, the secret is that the .docx file is just a zip file containing XML. The technique is to identify which zip file entry contains your checkbox, read its XML to a String, parse the string for your values, handoff to the clipboard. If you wish to keep the Java to an absolute minimum, a small function would take some sort of reference to your file and return the string of the zipfile entry. Do everything else in an activity or even data transform. Everything is done in-memory.
I happen to have a junior developer working on something very similar right now, except we are performing a template replacement on a source document, re-zipping, and sending back to the browser -- all in memory. Mike Townsend, does Pega care if we share the shape of that solution here? I imagine many people have wished they new how to do this.
Pegasystems Inc.
US
Matthew,
If you have some sample code that solves this problem that other folks could use to kickstart their customization and you want share, I'm fine with it. To the best of my knowledge, Pega doesn't object either. If I'm wrong, I'm sure someone will chime in and correct me.
Thanks,
Mike
HSBC
IN
I would request kindly share the code that is specific to use the parsing of docx that can be used in an activity that will be greatly helpful.
Jabil
US
The thing to remember is that MS Office documents are just a collection of XML files zipped up. The Java APIs present a virtual filesystem for the file. See https://docs.oracle.com/javase/7/docs/api/index.html?java/util/zip/package-summary.html
Treat the zip contents like you would a regular file on disk -- don't overcomplicate it. Unless the files are very large, you do not need to write anything to disk in your Pega routines -- everything is in memory.
Step 1: Prepare for the Pega coding by unzipping the docx file. Use a command line utility, or else rename the .docx extension to .zip to more easily use your favorite unzip tool. Whatever works. The Java jar command works well...
> jar -tf test.docx
> jar -xvf test.docx
Step 2: Find your the XML file that contains whatever you are interested in. It's just text, so a search tool or IDE will work well. In my simple example the phrase I typed in Word was at word/document.xml.
The thing to remember is that MS Office documents are just a collection of XML files zipped up. The Java APIs present a virtual filesystem for the file. See https://docs.oracle.com/javase/7/docs/api/index.html?java/util/zip/package-summary.html
Treat the zip contents like you would a regular file on disk -- don't overcomplicate it. Unless the files are very large, you do not need to write anything to disk in your Pega routines -- everything is in memory.
Step 1: Prepare for the Pega coding by unzipping the docx file. Use a command line utility, or else rename the .docx extension to .zip to more easily use your favorite unzip tool. Whatever works. The Java jar command works well...
> jar -tf test.docx
> jar -xvf test.docx
Step 2: Find your the XML file that contains whatever you are interested in. It's just text, so a search tool or IDE will work well. In my simple example the phrase I typed in Word was at word/document.xml.
Step 3: Try this in an IDE like Eclipse or Netbeans. Please be kind to yourself: Unless you are an old pro, don't develop Java inside Pega. Death by a thousand cuts. After you run perform(), exercise zipUp() for fun and create a zip file from thin air.
package replaceme;
import java.io.*;
import java.util.zip.*;
public class TestClass {
public static void main(String[] args) {
TestClass tc = new TestClass();
tc.perform();
// tc.zipUp();
}
private void perform() {
try {
ZipInputStream zis = new ZipInputStream(new FileInputStream(new File("path-to-my-file/test.docx")));
ZipEntry entry = null;
byte[] b = null;
while (null != (entry = zis.getNextEntry())) {
// System.out.println(entry.getName() + " " + entry.isDirectory());
if ("word/document.xml".equals(entry.getName())) {
int size = (int)entry.getSize();
b = new byte[size];
zis.read(b, 0, size);
}
}
zis.close();
String s = new String(b);
System.out.println(s);
System.out.println(s.contains("Some text I’m looking for"));
} catch (Exception e) {
e.printStackTrace();
}
}
private void zipUp() {
try {
ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(new File("path-to-create-my-outputfile/testout.zip")));
ZipEntry ze1 = new ZipEntry("foo.txt");
ZipEntry ze2 = new ZipEntry("bar.txt");
byte[] b = "This little piggy went to market.".getBytes();
zos.putNextEntry(ze1);
zos.write(b, 0, b.length);
b = "This little piggy stayed home.".getBytes();
zos.putNextEntry(ze2);
zos.write(b, 0, b.length);
zos.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}
Step 4: Once you understand the above, revise this code to work into your Pega activity. You are responsible for obtaining a page reference to your Word document. For this example I attached a file to a work object and examined the clipboard to find the pzInsKey. I won't belabor how Pega stores binary content... it's all in the developer help.
Pay attention to the file encoding of the Base64 decode call. This could vary depending on your desktop settings. If in doubt, UTF-8 might be forgiving. Also note that Pega has a couple functions for UTF-8 decoding, but they wrap only the purely string-based decode calls and are useless here. Consider writing your own.
If you haven't done this before, notice the pageContent variable. This Local Variable is declared on the Parameters tab. Then you can use it outright in a Java step and refer to it in other activity steps with the "local." prefix...
... continued image....
If you run that activity, the last step will look odd because Pega won't understand how to render that XML. Nevertheless, I was able to see the text from my Word document. As for parsing your Word text and mapping the result to clipboard properties, I recommend regular expressions for the former, and using a local variable in the Java step for the latter.
Illustrating all this in an Activity is easiest to understand, but personally I would put the Java in one or two Pega Functions. Maybe work in safety sanity checks for the size of the file or content.
Cheers!
GovCIO
US
Ravi and I had an offline discussion on this requirement and I requested him to post the query over here as I didn't work on this requirement earlier. But I would like to know the java code or whatever the code which solves his current issue. I will make a note of it and will see if it is going to be useful in my future req stand point.
Thanks,
Ravi Kumar.
Accepted Solution
HSBC
IN
Thanks a lot Osborn for such a detailed explanation. This is very helpful.
However I tried approaching this requirement with a different method ,
Step 1 : I converted the Word Document into PDF editable document. Where I can enter data onto PDF like we do on word using Adobe Acrobat.
Step 2 : I used the eForm Accelerator to create eFrom rule as well as corresponding Map eForm rule using the PDF I created.
Step 3 : Then I created an activity to call OOTB ExtractDataFromeFrom activity and passing all the mandatory parameters correctly I was able to map the data entered into the PDF document onto my clipboard.
I understand I had to change the file type but this way I was easily able grab the data I needed from a fixed format editable PDF document.
Thanks a lot everyone who helped me in this discussion. Appreciate the community.
Thanks
Ravi Sharma