Ultimately, it comes down to
- PDF documents contain (at minimum) only rendering instructions. A piece of text in a PDF document does not exist as such. Even text extraction itself is a non-trivial problem.
- Detecting (let alone processing) tables, lists and other content in PDF documents is very hard. It's the kind of thing thesis papers are written about.
If all of your incoming PDF documents look similar, you can use iText7 and pdf2Data (an iText7 add-on that features some good algorithms for doing table detection, sentence and paragraph detection etc).
And if you have further questions, it is advisable to post them on StackOverflow (unless you are paying customer, in which case you can directly access our jira board). We make a point of checking StackOverflow at least daily.
Posted: 6 years ago
Posted: 21 Sep 2017 7:32 EDT
Mitchell Vega (Mitchell)
Associate Product Manager, Robotics Engine