Question
HTC
IN
Last activity: 5 Feb 2024 5:58 EST
How to extract table from PDF if the table spanning to next page(one table spanning multiple pages)
Hi,
We are attempting to extract a table that spans multiple pages from readable PDF file.
whole PDF is highlighting when we are trying to select Table region(please refer attachment).
is there any way to extract the table spans multiple pages from a readable PDF.
***Edited by Moderator Marissa to add Support Case details***
***Edited by Moderator Marije to remove INC-B1427 (Pega.ServerDeploy issue) and replace with INC-B434 (pdf issue) ***
***Edited by Moderator Marije to add new BUG-849540 ***
-
Reply
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Accepted Solution
Pegasystems Inc.
GB
@AbhishekR17024116 @ManikandanT17003272 BUG-849540 and INC-B434 (Issue with PDF Table in 22.1.24) was closed with the conclusion that the PDF file is not supported :
Issue being investigated was why the vertical and horizontal lines are not correctly counted after reduction.
The PDF was found to have non visible table lines - lines not for display but are listed in the structure of the PDF. That causes the table recognition code to interpret the table is designed from top to bottom of the pdf page.
This causes the pdf connector table recognition code to identify the table as full page. This structure is atypical of an ordinary pdf table.
A feature enhancement would be required for the pdf connector to recognize when the table lines are visible and non visible for such a pdf structure.
If support for such a pdf is needed please request assistance from our GCS team in placing a feature enhancement request through the ticketing system.
cc @ThomasSasnett cc @Mitchell
Pegasystems Inc.
US
@ManikandanT17003272Is it possible to attach an example of that PDF (without any real data of course)? I believe it is possible to ignore the headers and footers, but it is not something I do regularly, so having the PDF to test with would be helpful.
Updated: 4 Jan 2024 1:17 EST
HTC Global Services
IN
Continuing with mani's post. Please find the test pdf attached.
Pegasystems Inc.
US
@AbhishekR17024116 I believe there is something odd with this specific PDF. I have opened a support request to get an explanation as to why this table is being misread. The INC is INC-B434.
Normally, you can simply elect to have the table span pages, however in this case, this table seems to include the entire page. While it is possible to read this and work with this, it is not ideal. If you had to work with this PDF now, you would have an extra column which essentially splits the Amount column into two parts. You could join them together in your automation to get the full value. In addition, it would contain most of the values on each page, so you would need to exclude the information from the table that you do not want. I believe there is an explanation for this PDF though, and I will update once I get word back from support.
Pegasystems Inc.
US
@ThomasSasnett Here is a link to the documentation on working with PDFs.
Pegasystems Inc.
US
@ThomasSasnett The customer has opened INC-B1427 on this issue and the one I opened has been closed.
Updated: 5 Feb 2024 5:47 EST
Pegasystems Inc.
GB
@ThomasSasnett INC-B1427 does not relate to PDF but to Pega.ServerDeploy overrides.
INC-B434 Pega Support ticket (Issue with PDF Table in 22.1.24) is still open!
GCS will contact you today w.r.t testing out the PDF connector.
Update: BUG-849540 logged and team is investigating further
Accepted Solution
Pegasystems Inc.
GB
@AbhishekR17024116 @ManikandanT17003272 BUG-849540 and INC-B434 (Issue with PDF Table in 22.1.24) was closed with the conclusion that the PDF file is not supported :
Issue being investigated was why the vertical and horizontal lines are not correctly counted after reduction.
The PDF was found to have non visible table lines - lines not for display but are listed in the structure of the PDF. That causes the table recognition code to interpret the table is designed from top to bottom of the pdf page.
This causes the pdf connector table recognition code to identify the table as full page. This structure is atypical of an ordinary pdf table.
A feature enhancement would be required for the pdf connector to recognize when the table lines are visible and non visible for such a pdf structure.
If support for such a pdf is needed please request assistance from our GCS team in placing a feature enhancement request through the ticketing system.
cc @ThomasSasnett cc @Mitchell