LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Best vision model for dense OCR? (alien.top)

submitted 1 year ago by madmax_br5@alien.top to c/localllama@poweruser.forum

1 comments fedilink hide all child comments

I have some old engineering textbooks and wanted to try taking pictures of the pages, extracting the text with a vision model, and using this data to fine-tune an LLM. I may need to fine-tune the vision model first in order to parse the text into a markdown format. But my question is which base vision model to use, especially given the dense nature of the text. These models are not well documented in terms if what input resolutions they support. Nougat? Bakllava? Tesseract? Would appreciate advice on a good starting point to avoid burning too much time down the wrong path.

In summary:

Goal is to extract text from pictures of textbook pages into markdown format.
Photos will be normal ~12MP images captured with my phone camera, one page per photo

you are viewing a single comment's thread
view the rest of the comments

[–] Gullible_Response_54@alien.top 1 points 1 year ago

https://readcoop.eu/de/transkribus/