As it comes to PDF to text conversion, you have 3 options:
for PDFs with textual layer one may use Rule-File-Binary#pyExtractText extension point (which is using PDFBox to extract the text)
for PDFs that require OCR (non-textual layers) or PNGs or JPGs one may use either PegaOCR component (for on-prem solutions) or DPS (for cloud solutions) - pyExtractText will then invoke pyOcrAttachmentAnalysis to do OCR.
The activity pyExtractText returns text that can be then fed into Pega NLP. As it comes to DPS (Document Processing Service) this is yet unreleased, but you can contact @grabm - our product owner to have early access.