LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

communick@poweruser.forum

Chunking and storing structured data and vectors for RAG (alien.top)

submitted 10 months ago by Smerfj@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

TL:DR is there an example someone can point me to for RAG with highly structured documents where the agent returns conversation along with cross references to document paragraphs or sections? Input= long text document (~500-1000 page), output is Q/A with references to document paragraph, page, or other simple cross reference.

I've been looking into RAG in my (extremely limited) spare time for a few months now but I'm getting hung up on vector databases. It may be due to the fact that my use case revolves around highly structured specification documents where I desire to be able to recover section and paragraph references in a QA session with a rag assistant.

Most off-the-shelf solutions seem to not care what your data looks like and just provides a black box solution for data chunking and vectoring, like having a single HTML link for a website for the source information and magically it works. This confuses me because langchain has a great learning path that includes quite a bit of focus on proper data chunking and vector database structuring, then literally every example treats the chunking and vector store step as an afterthought. I don't like to do something I don't understand so I've been focused more on creating a database for my data that makes sense in my brain.

I have successfully created a local vector database (sqlite) with SBERT that returns paragraph numbers with a similarity search but I haven't bridged that to feeding those results into an LLM.

Am I thinking too hard about this? Are the off the shelf rag solutions able to handle the paragraph numbers without me explicitly trying to cram them into a database structure? Or am I on the right path, and I should continue with the database that makes sense to me and keep figuring out how to implement the LLM step after the vector search?

I started looking at llamaindex, then Langchain, now autogen. But my spare time is limited enough that I haven't implemented anything with any of these, only a (successful) sbert similarity search which didn't use any of these. If someone has an example for structured documents where the q/a provides cross-references, I'd really appreciate it.

you are viewing a single comment's thread
view the rest of the comments

[–] Hey_You_Asked@alien.top 1 points 10 months ago

RemindMe! 1 week