Need to read data from the doc and pdf

Question

redds9

Member since 2014

2 posts

PEGA

Posted: Nov 9, 2016

Last activity: Nov 10, 2016

Posted: 9 Nov 2016 2:05 EST
Last activity: 10 Nov 2016 9:30 EST

Closed

Need to read data from the doc and pdf

Report

Hi,

I have a requirement where we get different types of files(doc,docx,pdf) on the filesystem path. Based on the filename input i will be pulling the file from the server and read through the contents of the files and copy it to the property whose control is rich text editor. I am able to read the file contents however, the format, alignment, images, tables are coming as text and displaying the data without any format or alignment. I am reading the file from java code as we have to pic only a specific file which cannot be achieved by file listener.

Please suggest if there is any approach or do i need to modify in my code copied below.

This is for PDF.

com.pega.apache.pdfbox.util.PDFTextStripper pdfStripper = null;
com.pega.apache.pdfbox.pdmodel.PDDocument pdDoc=null;
com.pega.apache.pdfbox.cos.COSDocument cosDoc = null;
ParameterPage pp = tools.getParameterPage();
try{
String filePath = pp.getString("FullFilePathName");
//java.io.File file = new java.io.File(filePath);
PRFile prfCheck = new PRFile(filePath);
// java.io.FileInputStream fis=null;
PRInputStream fis = null;
com.pega.apache.pdfbox.pdfparser.PDFParser parser = new com.pega.apache.pdfbox.pdfparser.PDFParser(new PRInputStream(prfCheck));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new com.pega.apache.pdfbox.util.PDFTextStripper();
pdDoc = new com.pega.apache.pdfbox.pdmodel.PDDocument(cosDoc);

Hi,

Please suggest if there is any approach or do i need to modify in my code copied below.

This is for PDF.

String parsedText = pdfStripper.getText(pdDoc);

tools.putParamValue("ContentSourceAuthored",parsedText);
}catch(Exception e) {
throw new PRRuntimeException("Unable to read file '"+e);
}

Show Less

To see attachments, please log in.

Java and Activities

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Posted: 8 years ago

Posted: 10 Nov 2016 8:50 EST

JOHNPW_GCS replied to redds9

Report

Since this isn't something PRPC offers OOTB - and you are already utilizing the third-party library 'pdfBox' (which does ship with PRPC); I think you also need to check more general forums for advice on this.

Stackoverflow has a few hits regarding extracting images etc., from PDFs using pdfBox for instance : http://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox

The same goes for the file types you mention (WORD): you can use the 'Apache POI' library for this - which also ships with PRPC (actually it is a repackaged version of the library I believe in this case).

Sorry I couldn't provide any more specific information here - maybe somebody else has done something similar with PRPC who could help out here ?

To see attachments, please log in.

Like (0)

Question

Need to read data from the doc and pdf

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

Need to read data from the doc and pdf

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.