LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Extract Tables from PDFs (alien.top)

submitted 2 years ago by G_S_7_wiz@alien.top to c/localllama@poweruser.forum

2 comments fedilink hide all child comments

I am working on a project where I have to extract tables from PDFs(usually financial reports which contain lot of tables(simple tables and cells merged tables) and graphs).
Following are the libraries that have been used without much great results:
Naugat, PyMuPDF(fitz) , PyPDF2 , pdfplumber, PDFMiner, Camelot, Tabula, pdfquery

What other OCR, LLMs or other tools do you recommend to proceed further? Thanks in advance!

top 2 comments

sorted by: hot top controversial new old

[–] vec1nu@alien.top 1 points 2 years ago

I've had good results using https://github.com/DevashishPrasad/CascadeTabNet

[–] Chaosdrifer@alien.top 1 points 2 years ago

You might want to look into llamaIndex’s SECinsight repo. https://github.com/run-llama/sec-insightsz they do a lot of parsing on financial documents.