this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

anyone knows some robust open source library for extracting tables from pdf , even ocr library is fine

P.S- i have already tried tabula ,camelot , ing2table, unstructured.io and most of the document loader in langchain , none of them are even 95% robust

you are viewing a single comment's thread
view the rest of the comments
[–] happy_dreamer10@alien.top 1 points 9 months ago (1 children)

can it extract tables automatically ? that too one with merged cells ?

[–] Kimononono@alien.top 1 points 9 months ago

I haven’t used it in awhile but I remember it being able to extract my headers which were in merged cells. It’s fairly high level so it’s worth a try