this post was submitted on 01 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I have some old engineering textbooks and wanted to try taking pictures of the pages, extracting the text with a vision model, and using this data to fine-tune an LLM. I may need to fine-tune the vision model first in order to parse the text into a markdown format. But my question is which base vision model to use, especially given the dense nature of the text. These models are not well documented in terms if what input resolutions they support. Nougat? Bakllava? Tesseract? Would appreciate advice on a good starting point to avoid burning too much time down the wrong path.

In summary:

  • Goal is to extract text from pictures of textbook pages into markdown format.
  • Photos will be normal ~12MP images captured with my phone camera, one page per photo
you are viewing a single comment's thread
view the rest of the comments
[–] Gullible_Response_54@alien.top 1 points 10 months ago