Discussion

Areteans
Areteans
NL
Areteans
Posted: Feb 5, 2019
Last activity: Feb 5, 2019
Last activity: 5 Feb 2019 11:59 EST
Apache POI – Parsing Excel Workbook
Requirement :
The requirement was to parse an excel file which is in tabular format but number of columns, column names can be different for each excel, i.e. an unstructured excel file.
Since API activities like pxParseExcelFile , MSOParseExcel works with only excel files with a predefined structure, we needed a custom solution to fulfill our requirement. We have used Apache POI api to implement this.
Notes :
Apache POI supports both XLS and XLSX type of files.
API Docs for Apache POI:
- The solution used only XSSFWorkbook class to parse XLSX file but the same code can be extended with HSSFWorkbook to work with XLS files too.
- The Java code has been attached in the post, and that can be reused as a function.
Here are some documentation about the java code.
- To read an XLSX file :
PRInputStream myxls = new PRInputStream(FileName);
org.apache.poi.xssf.usermodel.XSSFWorkbook wb = new org.apache.poi.xssf.usermodel.XSSFWorkbook(myxls);
Filename can be set from pxRequestor.pyFileUpload (pzFilePathWithForm control updates this property with file location string like : "file://web:/StaticContent/global/ServiceExport/filename.xlsx" )
- To get the number of sheets in a workbook
int max_sheets = wb.getNumberOfSheets();
- To get the name of current sheet
String currSheetname = wb.getSheetName(sheetnum) ;
- To access a specific sheet:
org.apache.poi.xssf.usermodel.XSSFSheet sheet = wb.getSheetAt(sheetnum);
- To get a specific row :
org.apache.poi.xssf.usermodel.XSSFRow row = sheet.getRow(j);
- To get data in a specific cell:
org.apache.poi.xssf.usermodel.XSSFCell cellData = row.getCell(k); -- K represents the number of the cell
Output :
The java function will parse the excel and store the data in a pagelist inside a pagelist structure.
We are storing column level information in a pagelist because the number of columns is not predefined.
tempParsedData.pxResults() -- will contain data for each row and .pxResults(1) will have column names
-->.SingleValueProp() -- a pagelist property for each .pxResults() which hold information cellwise i.e. column wise.