from Fractal Reader Development and Operation Diary Extracting clean text from PDFs
- This seems quite troublesome. I couldn’t find any commercially available libraries that can extract text cleanly from PDFs.
- It seems like using the code around
renderTextLayer
in pdf.js could enable text extraction (takker)- /takker/Explanation_of_text_embedding_process_in_PDF.js-viewer#63e9fa0d1280f0000016ae82
- It seems challenging to extract text neatly in the correct order from documents like two-column papers where the text is not neatly arranged (blu3mo)
- Ah, indeed (takker)
- If it’s okay to focus only on neatly aligned columns in PDFs, you could write a program to detect the number of columns and extract text accordingly.
- It would be easiest if text extraction could be done through Multimodal.
- Trial and Error Memo on Extracting Text from PDFs | Kan Hatakeyama
- PyMuPDF seems to offer good accuracy.
- It seems text can be cleanly selected even in a PDF viewer.
- I think this is limited to cases where both the text information and the order of the text are embedded in the PDF (takker).
- If it were a web browser, it should be constructing
<span>
elements in order according to the text sequence. - Selecting text in the order of the spans allows for clean text extraction.
- Of course, it’s powerless against PDFs where pages are embedded as images.