Looking at the code, this converts PDF pages to images, then transcribes each im...

firesteelrain · 2025-08-17T22:58:49 1755471529

There is a very popular Python module called ocrmypdf. I used it to help my HOA and OCR’ing of old PDFs.

No LLMs required.

dreamcompiler · 2025-08-19T04:17:13 1755577033

20 years ago I tried in vain to get my HOA to use the virtual printer for PDF documents so they'd be searchable. The capability was built in to both Mac and Windows even way back then.

No luck. They just could not grasp it. So they kept using their process of printing out the file on paper and then scanning it back in as a PDF image file.

I finally quit trying. Now of course they've seen the light and are painstakingly OCRing all that old stuff.

firesteelrain · 2025-08-19T11:44:59 1755603899

Ouch! I am on the BOD so as an IT/Engineering Professional I can influence things better

cess11 · 2025-08-18T07:43:58 1755503038

It's nice, I've used it as a fallback text extraction method in an ETL flow that chugged through tens of thousands of corporate and legal PDF files.

westurner · 2025-08-17T23:21:46 1755472906

Shell: GNU parallel, pdftotext

Python: PyPdf2, PdfMiner.six, Grobid, PyMuPdf; pytesseract (C++)

paperetl is built on grobid: https://github.com/neuml/paperetl

annotateai: https://github.com/neuml/annotateai :

> annotateai automatically annotates papers using Large Language Models (LLMs). While LLMs can summarize papers, search papers and build generative text about papers, this project focuses on providing human readers with context as they read.

pdf.js-hypothes.is: https://github.com/hypothesis/pdf.js-hypothes.is:

> This is a copy of Mozilla's PDF.js viewer with Hypothesis annotation tools added

Hypothesis is built on the W3C Web Annotations spec.

dokieli implements W3C Web Annotations and many other Linked Data Specs: https://github.com/dokieli/dokieli :

> Implements versioning and has the notion of immutable resources.

> Embedding data blocks, e.g., Turtle, N-Triples, JSON-LD, TriG (Nanopublications).

A dokieli document interface to LLMs would be basically the anti-PDF.

Rust crates: rayon handles parallel processing, pdf-rs, tesseract (C++)

pdf-rs examples/src/bin/extract_page.rs: https://github.com/pdf-rs/pdf/blob/master/examples/src/bin/e...

moritonal · 2025-08-17T23:30:17 1755473417

I imagine part of the issue is how many PDFs are just a series of images anyway.

enjaydee · 2025-08-17T23:46:02 1755474362

Saw this tweet the other day that helped me understand just how crazy PDF parsing can be

https://threadreaderapp.com/thread/1955355127818358929.html

constantinum · 2025-08-18T02:42:58 1755484978

There are a few other reasons why PDF parsing is Hell! > https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

ethan_smith · 2025-08-18T04:55:32 1755492932

Image-based extraction often preserves layout and handles PDFs with embedded fonts, scanned content, or security restrictions better than direct text extraction methods.