What libraries do you see as being SOTA? Fitz? Tika? My hope is that computer vi...

notacop31337 · on Nov 18, 2022

To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!

999900000999 · on Nov 18, 2022

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

mythrwy · on Nov 18, 2022

There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.

999900000999 · on Nov 18, 2022

Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.