this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

anyone knows some robust open source library for extracting tables from pdf , even ocr library is fine

P.S- i have already tried tabula ,camelot , ing2table, unstructured.io and most of the document loader in langchain , none of them are even 95% robust

top 9 comments
sorted by: hot top controversial new old
[–] Kimononono@alien.top 1 points 9 months ago (1 children)
[–] happy_dreamer10@alien.top 1 points 9 months ago (1 children)

can it extract tables automatically ? that too one with merged cells ?

[–] Kimononono@alien.top 1 points 9 months ago

I haven’t used it in awhile but I remember it being able to extract my headers which were in merged cells. It’s fairly high level so it’s worth a try

[–] Dry_Long3157@alien.top 1 points 9 months ago (1 children)

nougat by Facebook is your best bet.

[–] happy_dreamer10@alien.top 1 points 9 months ago (1 children)

thanks will check it out . have you tried it ?

[–] Dry_Long3157@alien.top 1 points 9 months ago

Yup, it's the best I've tried for tables and math formulas.

[–] simion314@alien.top 1 points 9 months ago

Are your pdf random documents from users ? If yes then it will a problem since there can be many ways the pdfs are structured depending on whatever tool was used. If all the pdfs are the same, like created by the same tool then maybe you have a chance, I would inspect the pdf layout and see if there are consistent and then maybe with a pdf library you can get the data (maybe you could use parts of pdf.js from mozila)

[–] arthurwolf@alien.top 1 points 9 months ago (1 children)

It's a long shot, but I think if you took DeepPanel (see github), and instead of training it on comic book panels, you set up a training dataset with PDF tables, it would generate the same kind of masks/heatmaps it generates for comic book panels, but for PDF tables (this gives you an image that represents where "table lines" are, and that removes all text and other random stuff, allowing you to process only the table lines).

Then from there, you could scan the image vertically first, doing an average of the pixel of each line of the heatmap to detect where "lines" are, and cut the table into rows. Then once you have the rows, you do the same on each row to get the columns/cell.

I do this for comic book panels and it works very well, I see no reason why it wouldn't work for PDF tables.

It's a lot of work but I'm fairly certain it'd work.

Then once you have the cells, it's just a matter of OCR (you could even maybe try llava for that, I suspect it might work).

Tell me if you need help with this/more details about how I did it for comic books/how I would do it for PDF tables.

[–] happy_dreamer10@alien.top 1 points 9 months ago

thanks :) i dont want to go through training process , currently i m converting it to latex format which is working pretty fine.