LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

communick@poweruser.forum

Chunking and storing structured data and vectors for RAG (alien.top)

submitted 10 months ago by Smerfj@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

TL:DR is there an example someone can point me to for RAG with highly structured documents where the agent returns conversation along with cross references to document paragraphs or sections? Input= long text document (~500-1000 page), output is Q/A with references to document paragraph, page, or other simple cross reference.

I've been looking into RAG in my (extremely limited) spare time for a few months now but I'm getting hung up on vector databases. It may be due to the fact that my use case revolves around highly structured specification documents where I desire to be able to recover section and paragraph references in a QA session with a rag assistant.

Most off-the-shelf solutions seem to not care what your data looks like and just provides a black box solution for data chunking and vectoring, like having a single HTML link for a website for the source information and magically it works. This confuses me because langchain has a great learning path that includes quite a bit of focus on proper data chunking and vector database structuring, then literally every example treats the chunking and vector store step as an afterthought. I don't like to do something I don't understand so I've been focused more on creating a database for my data that makes sense in my brain.

I have successfully created a local vector database (sqlite) with SBERT that returns paragraph numbers with a similarity search but I haven't bridged that to feeding those results into an LLM.

Am I thinking too hard about this? Are the off the shelf rag solutions able to handle the paragraph numbers without me explicitly trying to cram them into a database structure? Or am I on the right path, and I should continue with the database that makes sense to me and keep figuring out how to implement the LLM step after the vector search?

I started looking at llamaindex, then Langchain, now autogen. But my spare time is limited enough that I haven't implemented anything with any of these, only a (successful) sbert similarity search which didn't use any of these. If someone has an example for structured documents where the q/a provides cross-references, I'd really appreciate it.

you are viewing a single comment's thread
view the rest of the comments

[–] AdamDhahabi@alien.top 1 points 10 months ago

Yesterday I tried GPT4All, and it references context by outputting 3 passages from my local documents. I could click on each of them and read the passage. But their implementation is only using some algorithm at the moment. Embedding based on semantic search is still on their roadmap.

https://preview.redd.it/dnoqmk4olazb1.png?width=1807&format=png&auto=webp&s=cdd1f17a2ea20100504c275094e52b61a6e054f7