this post was submitted on 26 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hi,

I'm trying to understand all the stuff you're talking about. I have no ambitions of actually implementing anything. And I'm rather a beginner in the field.

With a few questions about retrieval-augmented generation:

I think I understand that RAG means that the shell around the LLM proper (say, the ChatGPT web app) uses your prompt to search for relevant documents in a vector database that is storing embeddings (vectors in a high-dimensional semantic ("latent") space), gets the most relevant embeddings (encoded chunks of documents) and feeds them into the LLM as (user-invisible) part of the prompt.

  1. Why do we need embeddings here? We could use a regular text search, say in Solr, and stuff the prompt with human-readable documents. Is it just because embeddings compress the documents? Is it because the transformer's encoder makes embeddings out of it anyway, so you can skip that step? On the other hand, having the documents user-readable (and usable for regular search in other applications) would be a plus, wouldn't it?

  2. If we get back embeddings from the database we cannot simply prepend the result to the prompt, can we? Because embeddings are something different than user input, it needs to "skip" the encoder part, right? Or can LLMs handle embeddings in user prompt, as they seem to be able to handle base64 sometimes? I'm quite confused here, because all those introductory articles seem to say that the retrieval result is prepended to the prompt, but that is only a conceptual view, isn't it?

  3. If embeddings need to "skip" part of the LLM, doesn't that mean that a RAG-enabled system cannot be a mere wrapper around a closed LLM, but that the LLM needs to change its implementation/architecture, if only slightly?

  4. What exactly is so difficult? I'm always reading about how RAG is highly difficult, with chunking and tuning and so on. Is it difficult because the relevance search task is difficult, so that it is just as difficult as a regular text relevance search with result snippets and facetting and so on, or is there "more" difficulty? Especially re: chunking, what is the conceptual difference between chunking in the vector database and breaking up documents in regular search?

Thank you for taking some time out of your Sunday to answer "stupid" questions!

top 7 comments
sorted by: hot top controversial new old
[–] frenchguy@alien.top 1 points 11 months ago

Re: 1, we need embeddings because a regular search may not find all relevant documents. Regular text search looks for documents in the corpus that contain the search terms, i.e., the terms used by the user in their question, which may be quite different from the terms present in the documents.

Fuzzy search extends that, and synonyms extends it further, or other "advanced" search and indexing techniques, but still, we don't want to miss any potential lose match.

What we want in the context of RAG is to cast a net as wide as possible. The problem is the context window. If the context window was infinite we would send the entire corpus with each request (and indeed, in the case of a small corpus this is sometimes done).

But for a large corpus, the goal is to find the largest possible set of relevant document (or document chunks) that will fit in the context window. Embeddings are the best solution for this.

[–] InevitablePressure63@alien.top 1 points 11 months ago (1 children)

First I need to classify that in RAG the input is the original natural language documents rather than vectors. So Embeddings is just a method to achieve retrieval, you can use any method you like, including string based search, or mix multiple approaches. Anyhow it is just a method to enrich your prompt with relevant context, it only changes the prompt instead of the model architecture.

[–] InevitablePressure63@alien.top 1 points 11 months ago
  1. Yes, because the relevant text search is difficult. Different chunking strategies may lead to different accuracy, and there’re many papers discussing new methods to improve it. Also the retrieved result should be suitable for the model to get best performance.
[–] Inevitable-Highway85@alien.top 1 points 11 months ago

I recommend the courses from learn.deeplearning.ai, I learned the basics of rag there.

[–] MLRS99@alien.top 1 points 11 months ago (1 children)

Any ideas to get better quality responses with document search7retrival?

I've done some experiments with langchan and its embedchain and I must say that the quality of answers when querying with embedded documents is lacking.

Do I need to use better prompting - if so what kind of technique is best?

Often i get responses like 'found nothing' etc.

[–] __SlimeQ__@alien.top 1 points 11 months ago (1 children)

Let me start off by saying I haven't gotten this right yet. But still...

When autogpt went viral, the thing everyone started talking about was vector db's and how they can magically extend the context window. This was not a very informed idea and implementations have been lacking.

It turns out that merely finding similar messages in the history and dumping them into the context is not enough. While this may sometimes give you a valuable nugget, most of the time it will just fill the context with repetitive garbage.

What you really need for this to work imo, is a structured narrative around finding the data, reading it, and reporting the data. LLMs respond extremely poorly to random, disconnected dialogue. They don't know what to do with it. So for one thing, you'll need a reasonable amount of pre-context for each data point so that the bot can even understand what's being talked about. But now this is prohibitively long, 4 or 5 matches on your search and your context is probably full. So you'll need to do some summarizing before squeezing it into the live conversation, which means your request takes 2x longer, at a minimum, and then you need to weave that into your chat context in as natural a way as possible.

Honestly RAG as a task is so weird that I no longer expect any general models to be capable of it. Especially not 7B/13B. Even gpt4 can just barely do it. I think with a very clever dataset somebody could make an effective RAG Lora, but I've yet to see it.

[–] soba-yokai@alien.top 1 points 11 months ago

Have you seen the Self-RAG architecture that came out recently? I'm curious what you'd think of it.