Smerfj

joined 1 year ago

Chunking and storing structured data and vectors for RAG in c/localllama@poweruser.forum

[–] Smerfj@alien.top 1 points 1 year ago

Thanks for the pointers. Since my aims are using local models eventually, I'll take any efficiency I can squeeze out.

Chunking and storing structured data and vectors for RAG (alien.top)

submitted 1 year ago by Smerfj@alien.top to c/localllama@poweruser.forum

5 comments fedilink

TL:DR is there an example someone can point me to for RAG with highly structured documents where the agent returns conversation along with cross references to document paragraphs or sections? Input= long text document (~500-1000 page), output is Q/A with references to document paragraph, page, or other simple cross reference.

I've been looking into RAG in my (extremely limited) spare time for a few months now but I'm getting hung up on vector databases. It may be due to the fact that my use case revolves around highly structured specification documents where I desire to be able to recover section and paragraph references in a QA session with a rag assistant.

Most off-the-shelf solutions seem to not care what your data looks like and just provides a black box solution for data chunking and vectoring, like having a single HTML link for a website for the source information and magically it works. This confuses me because langchain has a great learning path that includes quite a bit of focus on proper data chunking and vector database structuring, then literally every example treats the chunking and vector store step as an afterthought. I don't like to do something I don't understand so I've been focused more on creating a database for my data that makes sense in my brain.

I have successfully created a local vector database (sqlite) with SBERT that returns paragraph numbers with a similarity search but I haven't bridged that to feeding those results into an LLM.

Am I thinking too hard about this? Are the off the shelf rag solutions able to handle the paragraph numbers without me explicitly trying to cram them into a database structure? Or am I on the right path, and I should continue with the database that makes sense to me and keep figuring out how to implement the LLM step after the vector search?

I started looking at llamaindex, then Langchain, now autogen. But my spare time is limited enough that I haven't implemented anything with any of these, only a (successful) sbert similarity search which didn't use any of these. If someone has an example for structured documents where the q/a provides cross-references, I'd really appreciate it.