Question


Pegasystems Inc.
US
Last activity: 18 Jul 2017 10:57 EDT
How to read/parse Excel worksheet saved as MHT/MHTML
Our Pega 7 app is using Microsoft Excel files for inbound integration files from United Parcel Service (UPS). They send us files in 3 different versions of Excel, .XLS, .XLSX, and a 3rd hybrd *.XLS file that is actually an HTML document (MHTML)
We are using the Apache POI libraries in our app to read and parse the XLS and XLSX files.
We are unable to read/parse/consume the 3rd file type with the HTML content in an XLS wrapper.
Does anyone have any experience with consuming a MHTML/XLS file in a Peaga 7 app?
***Updated by moderator: Lochan to close post***
This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.
-
Like (0)
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Updated: 2 Jul 2015 11:37 EDT


Pegasystems Inc.
GB
Hi Bob,
I don't believe there is an OOTB PRPC method for this.....and also I wasn't able to come up with an answer here - but I'll put down what I looked into - in case it gives somebody else another idea.
In summary: it looks to be a bit of a messy business here - the MHTM format generated by EXCEL looks to be non-standard HTML - but there is a perhaps a glimmer of hope by using this StackOver Flow post : parsing - How to read or parse MHTML (.mht) files in java - Stack Overflow
Cheers
John
NOTES:
I tried opening an EXCEL-Generated 'MHT' file using Apache POI (just in 'Netbeans' here - I'm trying totally outside of PRPC to start with).
This failed - as expected:
[...] InputStream inp = new FileInputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\Book1.mht"); Workbook wb = WorkbookFactory.create(inp); [..]
Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream
at com.pega.gcs.Main.<init>(Main.java:69)
Hi Bob,
I don't believe there is an OOTB PRPC method for this.....and also I wasn't able to come up with an answer here - but I'll put down what I looked into - in case it gives somebody else another idea.
In summary: it looks to be a bit of a messy business here - the MHTM format generated by EXCEL looks to be non-standard HTML - but there is a perhaps a glimmer of hope by using this StackOver Flow post : parsing - How to read or parse MHTML (.mht) files in java - Stack Overflow
Cheers
John
NOTES:
I tried opening an EXCEL-Generated 'MHT' file using Apache POI (just in 'Netbeans' here - I'm trying totally outside of PRPC to start with).
This failed - as expected:
[...] InputStream inp = new FileInputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\Book1.mht"); Workbook wb = WorkbookFactory.create(inp); [..]
Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream
at com.pega.gcs.Main.<init>(Main.java:69)
at com.pega.gcs.Main.main(Main.java:74)
Caused by: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:177)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:143)
at com.pega.gcs.Main.<init>(Main.java:54)
... 1 more
I then found a few links regarding using 'JTIDY' (which does ship with PRPC) - but It seems the format that my Excel 2010 client created was found to be too proprietary for it to actually work.
Just for reference here's the code I tried:
package com.pega.gcs; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.InputStream; import java.io.StringWriter; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerException; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import org.w3c.dom.Document; import org.w3c.tidy.Tidy; public class Main { public String getStringFromDocument(Document doc) { try { DOMSource domSource = new DOMSource(doc); StringWriter writer = new StringWriter(); StreamResult result = new StreamResult(writer); TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(); transformer.transform(domSource, result); return writer.toString(); } catch (TransformerException ex) { ex.printStackTrace(); return null; } } Main() { try { Tidy tidy = new Tidy(); InputStream inp = new FileInputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\Book1.mht"); FileOutputStream fileOut = new FileOutputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\tidy.txt"); Document doc = tidy.parseDOM(inp, null); System.out.println(getStringFromDocument(doc)); //tidy.parse( inp, fileOut); } catch (Exception e) { throw new RuntimeException(e); } } public static void main(String[] args) { Main main = new Main(); } }
This generates the following output from JTidy:
line 12 column 1 - Warning: missing <!DOCTYPE> declaration
line 12 column 1 - Warning: plain text isn't allowed in <head> elements
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: <html> unexpected or duplicate quote mark
line 12 column 2 - Warning: unknown attribute "xmlns:x"
line 12 column 2 - Warning: unknown attribute "xmlns:o"
line 12 column 2 - Warning: unknown attribute "xmlns:v"
line 12 column 2 - Warning: discarding unexpected <html>
line 17 column 2 - Warning: <head> isn't allowed in <body> elements
line 18 column 1 - Warning: <meta> unexpected or duplicate quote mark
line 18 column 1 - Warning: <meta> attribute with missing trailing quote mark
line 18 column 1 - Warning: unknown attribute "workbook"
line 18 column 1 - Warning: <meta> lacks "content" attribute
line 18 column 1 - Warning: <meta> isn't allowed in <body> elements
line 19 column 2 - Warning: <meta> unexpected or duplicate quote mark
line 19 column 2 - Warning: <meta> attribute with missing trailing quote mark
line 19 column 2 - Warning: <meta> isn't allowed in <body> elements
line 21 column 2 - Warning: <meta> isn't allowed in <body> elements
line 22 column 2 - Warning: <meta> unexpected or duplicate quote mark
line 22 column 2 - Warning: <meta> attribute with missing trailing quote mark
line 22 column 2 - Warning: unknown attribute "excel"
line 22 column 2 - Warning: <meta> isn't allowed in <body> elements
line 23 column 2 - Warning: <link> unexpected or duplicate quote mark
line 23 column 2 - Warning: <link> unexpected or duplicate quote mark
line 23 column 2 - Warning: <link> isn't allowed in <body> elements
line 25 column 2 - Warning: <link> unexpected or duplicate quote mark
line 25 column 2 - Warning: <link> unexpected or duplicate quote mark
line 25 column 2 - Warning: <link> unexpected or duplicate quote mark
line 25 column 2 - Warning: <link> unexpected or duplicate quote mark
line 25 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""
line 25 column 2 - Warning: <link> isn't allowed in <body> elements
line 26 column 2 - Warning: <link> unexpected or duplicate quote mark
line 26 column 2 - Warning: <link> unexpected or duplicate quote mark
line 26 column 2 - Warning: <link> unexpected or duplicate quote mark
line 26 column 2 - Warning: <link> unexpected or duplicate quote mark
line 26 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""
line 26 column 2 - Warning: <link> Anchor "3D"shLink"" already defined
line 26 column 2 - Warning: <link> isn't allowed in <body> elements
line 27 column 2 - Warning: <link> unexpected or duplicate quote mark
line 27 column 2 - Warning: <link> unexpected or duplicate quote mark
line 27 column 2 - Warning: <link> unexpected or duplicate quote mark
line 27 column 2 - Warning: <link> unexpected or duplicate quote mark
line 27 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""
line 27 column 2 - Warning: <link> Anchor "3D"shLink"" already defined
line 27 column 2 - Warning: <link> isn't allowed in <body> elements
line 29 column 2 - Warning: <link> unexpected or duplicate quote mark
line 29 column 2 - Warning: <link> unexpected or duplicate quote mark
line 29 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""
line 29 column 2 - Warning: <link> Anchor "3D"shLink"" already defined
line 29 column 2 - Warning: <link> isn't allowed in <body> elements
line 31 column 2 - Warning: <script> unexpected or duplicate quote mark
line 31 column 2 - Warning: <script> unexpected or duplicate quote mark
line 31 column 2 - Warning: <script> lacks "type" attribute
InputStream: Document content looks like HTML proprietary
71 warnings, 41 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
The code is actually a hybrid of various stolen code from around the Web: the 'Document' DOM actually does output - but it is just the 'top-level' of the doc - the 'sheets' (where the actual useful data is) is totally missing.
Here's an abridged version (I removed a lot of the javascript here):
<html> <head> <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" name="generator"> <meta workbook="workbook" name="3D"Excel"> <meta charset="3Dwindows-125=" content="3D"text/html;" http-equiv="3DContent-Type"> <meta content="3DExcel.Sheet" name="3DProgId"> <meta excel="excel" content="3D"Microsoft" name="3DGenerator"> <link href="3D%22Book1_files/filelist.xml%22" rel="3DFile-List"> <link href="3D%22Book1_files/sheet001.htm%22" id="3D"shLink""> <link href="3D%22Book1_files/sheet002.htm%22" id="3D"shLink""> <link href="3D%22Book1_files/sheet003.htm%22" id="3D"shLink""> <link id="3D"shLink""> <title></title> </head> <body>MIME-Version: 1.0 X-Document-Type: Workbook Content-Type: multipart/related; boundary="----=_NextPart_01D0B4D9.B80124D0" This document is a Single File Web Page, also known as a Web Archive file. If you are seeing this message, your browser or editor doesn't support Web Archive files. Please download a browser that supports Web Archive, such as Windows® Internet Explorer®. ------=_NextPart_01D0B4D9.B80124D0 Content-Location: file:///C:/CE594991/Book1.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="windows-1252" <script language="3D"JavaScript""><!-- var c_lTabs=3D3; [...] </script> <!--[if gte mso 9]><xml> <x:ExcelWorkbook> <x:ExcelWorksheets> <x:ExcelWorksheet> <x:Name>Sheet1</x:Name> <x:WorksheetSource HRef=3D"Book1_files/sheet001.htm"/> </x:ExcelWorksheet> <x:ExcelWorksheet> <x:Name>Sheet2</x:Name> <x:WorksheetSource HRef=3D"Book1_files/sheet002.htm"/> </x:ExcelWorksheet> <x:ExcelWorksheet> <x:Name>Sheet3</x:Name> <x:WorksheetSource HRef=3D"Book1_files/sheet003.htm"/> </x:ExcelWorksheet> </x:ExcelWorksheets> <x:Stylesheet HRef=3D"Book1_files/stylesheet.css"/> <x:WindowHeight>9525</x:WindowHeight> <x:WindowWidth>22995</x:WindowWidth> <x:WindowTopX>480</x:WindowTopX> <x:WindowTopY>150</x:WindowTopY> <x:ProtectStructure>False</x:ProtectStructure> <x:ProtectWindows>False</x:ProtectWindows> </x:ExcelWorkbook> </xml><![endif]--> </body> </html>
The 'glimmer' of hope I mentioned above may come from the accepted answer here : http://stackoverflow.com/questions/3230305/how-to-read-or-parse-mhtml-mht-files-in-java
I haven't tried this yet : but it looks possible that the code in that post above should be able to extract out the various MIME sections of the input document.
Then (using some flakey assumptions...) looking for each of the 'sheetxxx.htm' files in the result - should give a method of extracting out the data in each sheet - which looks like it might be well-formed XML (HTML) - in a TABLE format....
So in theory - it should be amenable to a bit of XSLT transformation / XML Parse rules in PRPC....
------=_NextPart_01D0B4E3.D4105380
Content-Location: file:///C:/D0594992/book2_files/sheet001.htm
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="windows-1252"
<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:x=3D"urn:schemas-microsoft-com:office:excel" xmlns=3D"http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv=3DContent-Type content=3D"text/html; charset=3Dwindows-125= 2"> [...] <body link=3Dblue vlink=3Dpurple> <table border=3D0 cellpadding=3D0 cellspacing=3D0 width=3D64 style=3D'borde= r-collapse: collapse;table-layout:fixed;width:48pt'> <col width=3D64 style=3D'width:48pt'> <tr height=3D20 style=3D'height:15.0pt'> <td height=3D20 align=3Dright width=3D64 style=3D'height:15.0pt;width:48p= t'>123</td> </tr> <tr height=3D20 style=3D'height:15.0pt'> <td height=3D20 align=3Dright style=3D'height:15.0pt'>123</td> </tr> <![if supportMisalignedColumns]> <tr height=3D0 style=3D'display:none'> <td width=3D64 style=3D'width:48pt'></td> </tr> <![endif]> </table>
APPENDIX : This is my project 'Maven' file for the (failed) attempt at using JTIDY and POI here - so you can see what versions of libraries were tried.
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.pega.gcs</groupId> <artifactId>exceltest</artifactId> <version>1.0-SNAPSHOT</version> <packaging>jar</packaging> <dependencies> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.12</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml-schemas</artifactId> <version>3.12</version> </dependency> <dependency> <groupId>net.sf.jtidy</groupId> <artifactId>jtidy</artifactId> <version>r938</version> </dependency> </dependencies> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> </properties> </project>
Updated: 2 Jul 2015 11:59 EDT


Pegasystems Inc.
GB
A second idea: also not very pretty....
You could create a Macro Enabled Excel file and store it within PRPC as a Binary File (along with the MHTM files) ....then have the user Launch this file.
Using VBA - you can get EXCEL to perform the conversion for you - like this:
Sub Convert() Workbooks.Open Filename:="Book1.mht" ActiveWorkbook.SaveAs Filename:="Book1.xlsx", FileFormat:=xlOpenXMLWorkbook, CreateBackup:=False End Sub
But you would also have to put extra code in there to fetch the 'Book1.mht' file to a local directory (which is what the 'Edit In Excel' function does for (say) editing Data Tables) : it's a fairly complex task as you have to deal with HTTP/HTTPs, proxies etc; and then have the converted version re-uploaded to PRPC....
This would also almost certainly require you to sign the EXCEL Macro code with a Certificate...
And possibly use the 'dotNet' connector to call a separately written .NET program (providing a Webservice) that could fire up a copy of EXCEL and poke it as an object to perform the conversion.
http://<host>:<port>/prhelp/procomhelpmain.htm#rule-/rule-connect-/rule-connect-dotnet/main.htm


Pegasystems Inc.
GB
And you could even (and for the record, I don't like this solution at all!) call Powershell (again, assuming PRPC running on a Windows Box) via a Java Step, to do something like:
$x = New-Object -com Excel.Application
# $x.visible = $True
$x.Workbooks.Open("c:\Test.mht")
$x.ActiveWorkbook.SaveAs("c:\Book1.xlsx", [Microsoft.Office.Interop.Excel.XlFileFormat]::xlOpenXMLWorkbook)
$x.quit()
To convert the format before feeding it the standard Apache POI code....


Pegasystems
US
I think if you try the datatables feature or prpc, in particular the "edit table" feature, it passes the contents to excel, and then consumes the result after the user edits the excel sheet.
So, try tracing that and see what the consummation part does to get ideas. /Eric