How to read/parse Excel worksheet saved as MHT/MHTML

Question

kangb

Member since 2014

6 posts

PEGA

Posted: Jul 2, 2015

Last activity: Jul 18, 2017

Posted: 2 Jul 2015 9:58 EDT
Last activity: 18 Jul 2017 10:57 EDT

Closed

Solved

How to read/parse Excel worksheet saved as MHT/MHTML

Report

Our Pega 7 app is using Microsoft Excel files for inbound integration files from United Parcel Service (UPS). They send us files in 3 different versions of Excel, .XLS, .XLSX, and a 3rd hybrd *.XLS file that is actually an HTML document (MHTML)

We are using the Apache POI libraries in our app to read and parse the XLS and XLSX files.

We are unable to read/parse/consume the 3rd file type with the HTML content in an XLS wrapper.

Does anyone have any experience with consuming a MHTML/XLS file in a Peaga 7 app?

***Updated by moderator: Lochan to close post***
This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.

To see attachments, please log in.

Data Integration

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Posted: 10 years ago

Updated: 10 years ago

Posted: 2 Jul 2015 11:35 EDT
Updated: 2 Jul 2015 11:37 EDT

JOHNPW_GCS replied to kangb

Report

Hi Bob,

I don't believe there is an OOTB PRPC method for this.....and also I wasn't able to come up with an answer here - but I'll put down what I looked into - in case it gives somebody else another idea.

In summary: it looks to be a bit of a messy business here - the MHTM format generated by EXCEL looks to be non-standard HTML - but there is a perhaps a glimmer of hope by using this StackOver Flow post : parsing - How to read or parse MHTML (.mht) files in java - Stack Overflow

Cheers

John

NOTES:

I tried opening an EXCEL-Generated 'MHT' file using Apache POI (just in 'Netbeans' here - I'm trying totally outside of PRPC to start with).

This failed - as expected:

[...]
InputStream inp = new FileInputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\Book1.mht");
Workbook wb = WorkbookFactory.create(inp);
[..]

Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

at com.pega.gcs.Main.<init>(Main.java:69)

Hi Bob,

I don't believe there is an OOTB PRPC method for this.....and also I wasn't able to come up with an answer here - but I'll put down what I looked into - in case it gives somebody else another idea.

Cheers

John

NOTES:

I tried opening an EXCEL-Generated 'MHT' file using Apache POI (just in 'Netbeans' here - I'm trying totally outside of PRPC to start with).

This failed - as expected:

[...]
InputStream inp = new FileInputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\Book1.mht");
Workbook wb = WorkbookFactory.create(inp);
[..]

Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

    at com.pega.gcs.Main.<init>(Main.java:69)

    at com.pega.gcs.Main.main(Main.java:74)

Caused by: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:177)

    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:143)

    at com.pega.gcs.Main.<init>(Main.java:54)

    ... 1 more

I then found a few links regarding using 'JTIDY' (which does ship with PRPC) - but It seems the format that my Excel 2010 client created was found to be too proprietary for it to actually work.

Just for reference here's the code I tried:

package com.pega.gcs;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.StringWriter;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.tidy.Tidy;

public class Main {

    public String getStringFromDocument(Document doc) {
        try {
            DOMSource domSource = new DOMSource(doc);
            StringWriter writer = new StringWriter();
            StreamResult result = new StreamResult(writer);
            TransformerFactory tf = TransformerFactory.newInstance();
            Transformer transformer = tf.newTransformer();
            transformer.transform(domSource, result);
            return writer.toString();
        } catch (TransformerException ex) {
            ex.printStackTrace();
            return null;
        }
    }
    Main() {
        try {
            Tidy tidy = new Tidy();
            InputStream inp = new FileInputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\Book1.mht");
            FileOutputStream fileOut = new FileOutputStream("C:\\projects\\java\\exceltest\\src\\main\\resources\\com\\pega\\gcs\\tidy.txt");
            Document doc = tidy.parseDOM(inp, null);
            System.out.println(getStringFromDocument(doc));

            //tidy.parse( inp, fileOut);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] args) {
        Main main = new Main();

    }
}

This generates the following output from JTidy:

line 12 column 1 - Warning: missing <!DOCTYPE> declaration

line 12 column 1 - Warning: plain text isn't allowed in <head> elements

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: <html> unexpected or duplicate quote mark

line 12 column 2 - Warning: unknown attribute "xmlns:x"

line 12 column 2 - Warning: unknown attribute "xmlns:o"

line 12 column 2 - Warning: unknown attribute "xmlns:v"

line 12 column 2 - Warning: discarding unexpected <html>

line 17 column 2 - Warning: <head> isn't allowed in <body> elements

line 18 column 1 - Warning: <meta> unexpected or duplicate quote mark

line 18 column 1 - Warning: <meta> attribute with missing trailing quote mark

line 18 column 1 - Warning: unknown attribute "workbook"

line 18 column 1 - Warning: <meta> lacks "content" attribute

line 18 column 1 - Warning: <meta> isn't allowed in <body> elements

line 19 column 2 - Warning: <meta> unexpected or duplicate quote mark

line 19 column 2 - Warning: <meta> attribute with missing trailing quote mark

line 19 column 2 - Warning: <meta> isn't allowed in <body> elements

line 21 column 2 - Warning: <meta> isn't allowed in <body> elements

line 22 column 2 - Warning: <meta> unexpected or duplicate quote mark

line 22 column 2 - Warning: <meta> attribute with missing trailing quote mark

line 22 column 2 - Warning: unknown attribute "excel"

line 22 column 2 - Warning: <meta> isn't allowed in <body> elements

line 23 column 2 - Warning: <link> unexpected or duplicate quote mark

line 23 column 2 - Warning: <link> unexpected or duplicate quote mark

line 23 column 2 - Warning: <link> isn't allowed in <body> elements

line 25 column 2 - Warning: <link> unexpected or duplicate quote mark

line 25 column 2 - Warning: <link> unexpected or duplicate quote mark

line 25 column 2 - Warning: <link> unexpected or duplicate quote mark

line 25 column 2 - Warning: <link> unexpected or duplicate quote mark

line 25 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""

line 25 column 2 - Warning: <link> isn't allowed in <body> elements

line 26 column 2 - Warning: <link> unexpected or duplicate quote mark

line 26 column 2 - Warning: <link> unexpected or duplicate quote mark

line 26 column 2 - Warning: <link> unexpected or duplicate quote mark

line 26 column 2 - Warning: <link> unexpected or duplicate quote mark

line 26 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""

line 26 column 2 - Warning: <link> Anchor "3D"shLink"" already defined

line 26 column 2 - Warning: <link> isn't allowed in <body> elements

line 27 column 2 - Warning: <link> unexpected or duplicate quote mark

line 27 column 2 - Warning: <link> unexpected or duplicate quote mark

line 27 column 2 - Warning: <link> unexpected or duplicate quote mark

line 27 column 2 - Warning: <link> unexpected or duplicate quote mark

line 27 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""

line 27 column 2 - Warning: <link> Anchor "3D"shLink"" already defined

line 27 column 2 - Warning: <link> isn't allowed in <body> elements

line 29 column 2 - Warning: <link> unexpected or duplicate quote mark

line 29 column 2 - Warning: <link> unexpected or duplicate quote mark

line 29 column 2 - Warning: <link> attribute "id" has invalid value "3D"shLink""

line 29 column 2 - Warning: <link> Anchor "3D"shLink"" already defined

line 29 column 2 - Warning: <link> isn't allowed in <body> elements

line 31 column 2 - Warning: <script> unexpected or duplicate quote mark

line 31 column 2 - Warning: <script> unexpected or duplicate quote mark

line 31 column 2 - Warning: <script> lacks "type" attribute

InputStream: Document content looks like HTML proprietary

71 warnings, 41 errors were found!

This document has errors that must be fixed before

using HTML Tidy to generate a tidied up version.

The code is actually a hybrid of various stolen code from around the Web: the 'Document' DOM actually does output - but it is just the 'top-level' of the doc - the 'sheets' (where the actual useful data is) is totally missing.

Here's an abridged version (I removed a lot of the javascript here):

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" name="generator">
<meta workbook="workbook" name="3D&quot;Excel">
<meta charset="3Dwindows-125=" content="3D&quot;text/html;" http-equiv="3DContent-Type">
<meta content="3DExcel.Sheet" name="3DProgId">
<meta excel="excel" content="3D&quot;Microsoft" name="3DGenerator">
<link href="3D%22Book1_files/filelist.xml%22" rel="3DFile-List">
<link href="3D%22Book1_files/sheet001.htm%22" id="3D&quot;shLink&quot;">
<link href="3D%22Book1_files/sheet002.htm%22" id="3D&quot;shLink&quot;">
<link href="3D%22Book1_files/sheet003.htm%22" id="3D&quot;shLink&quot;">
<link id="3D&quot;shLink&quot;">
<title></title>
</head>
<body>MIME-Version: 1.0 X-Document-Type: Workbook Content-Type: multipart/related; boundary="----=_NextPart_01D0B4D9.B80124D0" This document is a Single File Web Page, also known as a Web Archive file. If you are seeing this message, your browser or editor doesn't support Web Archive files. Please download a browser that supports Web Archive, such as Windows&reg; Internet Explorer&reg;. ------=_NextPart_01D0B4D9.B80124D0 Content-Location: file:///C:/CE594991/Book1.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="windows-1252"            <script language="3D&quot;JavaScript&quot;"><!--
var c_lTabs=3D3;
[...]
</script>
<!--[if gte mso 9]><xml>
<x:ExcelWorkbook>
  <x:ExcelWorksheets>
   <x:ExcelWorksheet>
    <x:Name>Sheet1</x:Name>
    <x:WorksheetSource HRef=3D"Book1_files/sheet001.htm"/>
   </x:ExcelWorksheet>
   <x:ExcelWorksheet>
    <x:Name>Sheet2</x:Name>
    <x:WorksheetSource HRef=3D"Book1_files/sheet002.htm"/>
   </x:ExcelWorksheet>
   <x:ExcelWorksheet>
    <x:Name>Sheet3</x:Name>
    <x:WorksheetSource HRef=3D"Book1_files/sheet003.htm"/>
   </x:ExcelWorksheet>
  </x:ExcelWorksheets>
  <x:Stylesheet HRef=3D"Book1_files/stylesheet.css"/>
  <x:WindowHeight>9525</x:WindowHeight>
  <x:WindowWidth>22995</x:WindowWidth>
  <x:WindowTopX>480</x:WindowTopX>
  <x:WindowTopY>150</x:WindowTopY>
  <x:ProtectStructure>False</x:ProtectStructure>
  <x:ProtectWindows>False</x:ProtectWindows>
</x:ExcelWorkbook>
</xml><![endif]-->
</body>
</html>

The 'glimmer' of hope I mentioned above may come from the accepted answer here : http://stackoverflow.com/questions/3230305/how-to-read-or-parse-mhtml-mht-files-in-java

I haven't tried this yet : but it looks possible that the code in that post above should be able to extract out the various MIME sections of the input document.

Then (using some flakey assumptions...) looking for each of the 'sheetxxx.htm' files in the result - should give a method of extracting out the data in each sheet - which looks like it might be well-formed XML (HTML) - in a TABLE format....

So in theory - it should be amenable to a bit of XSLT transformation / XML Parse rules in PRPC....

------=_NextPart_01D0B4E3.D4105380

Content-Location: file:///C:/D0594992/book2_files/sheet001.htm

Content-Transfer-Encoding: quoted-printable

Content-Type: text/html; charset="windows-1252"

<html xmlns:o=3D"urn:schemas-microsoft-com:office:office"
xmlns:x=3D"urn:schemas-microsoft-com:office:excel"
xmlns=3D"http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=3DContent-Type content=3D"text/html; charset=3Dwindows-125=
2">
[...]
<body link=3Dblue vlink=3Dpurple>

<table border=3D0 cellpadding=3D0 cellspacing=3D0 width=3D64 style=3D'borde=
r-collapse:
collapse;table-layout:fixed;width:48pt'>
<col width=3D64 style=3D'width:48pt'>
<tr height=3D20 style=3D'height:15.0pt'>
  <td height=3D20 align=3Dright width=3D64 style=3D'height:15.0pt;width:48p=
t'>123</td>
</tr>
<tr height=3D20 style=3D'height:15.0pt'>
  <td height=3D20 align=3Dright style=3D'height:15.0pt'>123</td>
</tr>
<![if supportMisalignedColumns]>
<tr height=3D0 style=3D'display:none'>
  <td width=3D64 style=3D'width:48pt'></td>
</tr>
<![endif]>
</table>

APPENDIX : This is my project 'Maven' file for the (failed) attempt at using JTIDY and POI here - so you can see what versions of libraries were tried.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.pega.gcs</groupId>
    <artifactId>exceltest</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    <dependencies>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.12</version>
        </dependency>
        <dependency>
            <groupId>net.sf.jtidy</groupId>
            <artifactId>jtidy</artifactId>
            <version>r938</version>
        </dependency>
    </dependencies>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>
</project>

Show Less

To see attachments, please log in.

Like (0)

Posted: 10 years ago

Updated: 10 years ago

Posted: 2 Jul 2015 11:57 EDT
Updated: 2 Jul 2015 11:59 EDT

JOHNPW_GCS replied to kangb

Report

A second idea: also not very pretty....

You could create a Macro Enabled Excel file and store it within PRPC as a Binary File (along with the MHTM files) ....then have the user Launch this file.

Using VBA - you can get EXCEL to perform the conversion for you - like this:

Sub Convert()
    Workbooks.Open Filename:="Book1.mht"
    ActiveWorkbook.SaveAs Filename:="Book1.xlsx", FileFormat:=xlOpenXMLWorkbook, CreateBackup:=False
End Sub

But you would also have to put extra code in there to fetch the 'Book1.mht' file to a local directory (which is what the 'Edit In Excel' function does for (say) editing Data Tables) : it's a fairly complex task as you have to deal with HTTP/HTTPs, proxies etc; and then have the converted version re-uploaded to PRPC....

This would also almost certainly require you to sign the EXCEL Macro code with a Certificate...

And possibly use the 'dotNet' connector to call a separately written .NET program (providing a Webservice) that could fire up a copy of EXCEL and poke it as an object to perform the conversion.

http://<host>:<port>/prhelp/procomhelpmain.htm#rule-/rule-connect-/rule-connect-dotnet/main.htm

To see attachments, please log in.

Like (0)

Posted: 10 years ago

Posted: 3 Jul 2015 10:09 EDT

JOHNPW_GCS replied to JOHNPW_GCS

Report

And you could even (and for the record, I don't like this solution at all!) call Powershell (again, assuming PRPC running on a Windows Box) via a Java Step, to do something like:

$x = New-Object -com Excel.Application

# $x.visible = $True

$x.Workbooks.Open("c:\Test.mht")

$x.ActiveWorkbook.SaveAs("c:\Book1.xlsx", [Microsoft.Office.Interop.Excel.XlFileFormat]::xlOpenXMLWorkbook)

$x.quit()

To convert the format before feeding it the standard Apache POI code....

To see attachments, please log in.

Like (0)

Posted: 10 years ago

Posted: 2 Jul 2015 14:17 EDT

ericosman_GCS replied to kangb

Report

I think if you try the datatables feature or prpc, in particular the "edit table" feature, it passes the contents to excel, and then consumes the result after the user edits the excel sheet.

So, try tracing that and see what the consummation part does to get ideas. /Eric

To see attachments, please log in.

Like (0)

Question

How to read/parse Excel worksheet saved as MHT/MHTML

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

How to read/parse Excel worksheet saved as MHT/MHTML

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.