this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

anyone knows some robust open source library for extracting tables from pdf , even ocr library is fine

P.S- i have already tried tabula ,camelot , ing2table, unstructured.io and most of the document loader in langchain , none of them are even 95% robust

you are viewing a single comment's thread
view the rest of the comments
[–] simion314@alien.top 1 points 9 months ago

Are your pdf random documents from users ? If yes then it will a problem since there can be many ways the pdfs are structured depending on whatever tool was used. If all the pdfs are the same, like created by the same tool then maybe you have a chance, I would inspect the pdf layout and see if there are consistent and then maybe with a pdf library you can get the data (maybe you could use parts of pdf.js from mozila)