overview for SatoshiNotMe

1

LLM Quiz for beginners (google form) (alien.top)

submitted 9 months ago by SatoshiNotMe@alien.top to c/localllama@poweruser.forum

0 comments fedilink

There are many great resources to learn about LLMs, but I haven't seen any good LLM quizzes. I made a quiz on LLM basics for absolute beginners:

https://docs.google.com/forms/d/e/1FAIpQLScbWN3qwqeIc0b1cCRqm7y8dP4hUQE6WySmqcTVxyVxruwdoA/viewform

(It's a google form, and emails are not collected).

If you're already deep in LLMs, this is not for you. For anyone starting to learn about LLMs, this might be a good way to test your understanding.

They are multiple choice and truel/false questions. I've tried to create nuanced questions that test how solid your understanding is. They can trip up those whose conceptual LLM knowledge is not great, and trying to answer these well can help clarify your understanding. Hope this is useful for LLM beginners.

(Incidentally, I made this as a class assignment for an "Intro to LLMs" guest lecture I gave recently)

[R] System 2 Attention (is something you might need too) in c/machinelearning@academy.garden

[–] SatoshiNotMe@alien.top 1 points 9 months ago

That was exactly my thought! In Langroid (the agent-oriented LLM framework from ex-CMU/UW-Madison researchers), we call it Relevance Extraction — given a passage and a query, use the LLM to extract only the portions relevant to the query. In a RAG pipeline where you optimistically retrieve top k chunks (to improve recall), the chunks could be large and hence contain irrelevant/distracting text. We concurrently do relevance extraction from these k chunks: https://github.com/langroid/langroid/blob/main/langroid/agent/special/doc_chat_agent.py#L801
One thing often missed in this is the un-necessary cost (latency and token-cost) of parroting out verbatim text from context. In Langroid we use a numbering trick to mitigate this: pre-annotate the passage sentences with numbers, and ask the LLM to simply specify the relevant sentence-numbers. We have an elegant implementation of this in our RelevanceExtractorAgent using tools/function-calling.

Here's a post I wrote about comparing Langroid's method with LangChain's naive equivalent of relevance extraction called `LLMChainExtractor.compress` , and no surprise Langroid's methos is far faster and cheaper:
https://www.reddit.com/r/LocalLLaMA/comments/17k39es/relevance_extraction_in_rag_pipelines/

If I had the time, the next steps would have been: 1. give it a fancy name, 2. post on arxiv with a bunch of experiments, but I'd rather get on with building 😄

llama.cpp server rocks now! 🤘 in c/localllama@poweruser.forum

[–] SatoshiNotMe@alien.top 1 points 9 months ago (1 children)

You mean we don’t need to use llama-cpp-Python anymore to serve this at an OAI-like endpoint?

What UI do you use and why? in c/localllama@poweruser.forum

[–] SatoshiNotMe@alien.top 1 points 10 months ago

A bit related. I think all the tools mentioned here are for using an existing UI.

But what if you wanted to easily roll your own, preferably in Python. I know of some options:

StreamLit

Gradio https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks

Panel https://www.anaconda.com/blog/how-to-build-your-own-panel-ai-chatbots

Reflex (formerly Pynecone) https://github.com/reflex-dev/reflex-chat https://news.ycombinator.com/item?id=35136827

Solara https://news.ycombinator.com/item?id=38196008 https://github.com/widgetti/wanderlust

I like streamlit (simple but not very versatile) And reflex seems to have a richer set of features.

My questions - Which of these do people like to use the most? Or are the tools mentioned by OP also good for rolling your own UI on top of your own software ?

1

OpenAI Assistants API in an Agent Framework (alien.top)

submitted 10 months ago by SatoshiNotMe@alien.top to c/localllama@poweruser.forum

0 comments fedilink

(Also posted on r/ML) I know this is the LocalLlama forum but there has been some interest in the OpenAI Assistants API, so here goes. Also of course a similar API service could be implemented using open/local models so it’s not entirely irrelevant!

Given the OpenAI Assistants API released last week, a natural next question was — how can we have several assistants work together on a task?

This was a perfect fit for the Langroid Multi Agent framework (which already works with the completions API and any other local/remote LLM).

For those interested in details of how to work with this API I wanted to share how we implemented a near-complete support for all Assistant features into the Langroid agent framework:

https://github.com/langroid/langroid/blob/main/langroid/agent/openai_assistant.py

We created an OpenAIAssistant class derived from ChatAgent. In Langroid you wrap a ChatAgent in a Task object to enable a multi agent interaction loop. Now the same can be done with an OpenAIAssistant object.

I made a Colab notebook which gradually builds up from simple examples to a two-agent system for structured information extraction from a document:

https://colab.research.google.com/drive/190Tk7t4AdY1P9F_NlZ33-YEoGnHweQQ0

Our implementation supports function-calling, tools (retrieval/RAG, code interpreter). For the code interpreter we capture the code logs and display them in the interaction.

We leverage persistent threads and assistants by caching their ids based on the username + machine + org, so that in a later session they could resume a previous thread + assistant. This is perhaps a simplistic implementation, I’m sure there are better ideas here.

A key feature that is currently disabled is caching: this is turned off because storing Assistant responses in threads is not allowed by the API.

In any case, hope this is useful to some folks, as I’ve seen a lot of questions about this API in various forums.

Chunking and storing structured data and vectors for RAG in c/localllama@poweruser.forum

[–] SatoshiNotMe@alien.top 1 points 10 months ago

Langroid has a DocChatAgent, you can see an example script here -

https://github.com/langroid/langroid-examples/blob/main/examples/docqa/chat.py

Every generated answer is accompanied by Source (doc link or local path), and Extract (the first few and last few words of the reference — I avoid quoting the whole sentence to save on token costs).

There are other variants of RAG scripts in that same folder, like multi-agent RAG (doc-chat-2.py) where you have one master agent delegating smaller questions to a retrieval agent and asking it in different ways if it can’t answer etc. There’s also a doc-chat-multi-llm.py where you can have the master agent powered by GPT4 and the RAG agent powered by a local LLM (because after all it only needs to do extraction and summarization).

Relevance Extraction in RAG pipelines in c/localllama@poweruser.forum

[–] SatoshiNotMe@alien.top 1 points 10 months ago

> intuitively it seems like you might be able to avoid calling a model at all b/c shouldn't the relevant sentences just be closer to the search

Not really, as I mention in my reply to u/jsfour above: Embeddings will give you similarity to the query, whereas an LLM can identify relevance to answering a query. Specifically, embeddings won't be able to find cross-references (e.g. Giraffes are tall. They eat mostly leaves), and won't be able to zoom in on answers -- e.g. the President Biden question I mention there.

Relevance Extraction in RAG pipelines in c/localllama@poweruser.forum

[–] SatoshiNotMe@alien.top 1 points 10 months ago

Here is the comparison for that specific example.

https://preview.redd.it/60yx347rkexb1.png?width=1126&format=png&auto=webp&s=9aeb12c48a85aee87c51ec94373afb9782cce200

1

Relevance Extraction in RAG pipelines (alien.top)

submitted 10 months ago by SatoshiNotMe@alien.top to c/localllama@poweruser.forum

4 comments fedilink

I came across this interesting problem in RAG, what I call Relevance Extraction.

After retrieving relevant documents (or chunks), these chunks are often large and may contain several portions irrelevant to the query at hand. Stuffing the entire chunk into an LLM prompt impacts token-cost as well as response accuracy (distracting the LLM with irrelevant text), and and can also cause bumping into context-length limits.

So a critical step in most pipelines is Relevance Extraction: use the LLM to extract verbatim only the portions relevant to the query. This is known by other names, e.g. LangChain calls it Contextual Compression, and the RECOMP paper calls it Extractive Compression.

Thinking about how best to do this, I realized it is highly inefficient to simply ask the LLM to "parrot" out relevant portions of the text: this is obviously slow, and also consumes valuable token generation space and can cause you to bump into context-length limits (and of course is expensive, e.g. for gpt4 we know generation is 6c/1k tokens vs input cost of 3c/1k tokens).

I realized the best way (or at least a good way) to do this is to number the sentences and have the LLM simply spit out the relevant sentence numbers. Langroid's unique Multi-Agent + function-calling architecture allows an elegant implementation of this, in the RelevanceExtractorAgent : The agent annotates the docs with sentence numbers, and instructs the LLM to pick out the sentence-numbers relevant to the query, rather than whole sentences using a function-call (SegmentExtractTool), and the agent's function-handler interprets this message and strips out the indicated sentences by their numbers. To extract from a set of passages, langroid automatically does this async + concurrently so latencies in practice are much, much lower than the sentence-parroting approach.

[FD -- I am the lead dev of Langroid]

I thought this numbering idea is a fairly obvious idea in theory, so I looked at LangChain's equivalent LLMChainExtractor.compress_docs (they call this Contextual Compression) and was surprised to see it is the simple "parrot" method, i.e. the LLM writes out whole sentences verbatim from its input. I thought it would be interesting to compare Langroid vs LangChain, you can see it in this Colab .

On the specific example in the notebook, the Langroid numbering approach is 22x faster (LangChain takes 145 secs, vs Langroid under 7 secs) and 36% cheaper (~900 output tokens with LangChain vs 40 with Langroid) with gpt4 than LangChain's parrot method (I promise this name is not inspired by their logo :)

I wonder if anyone had thoughts on relevance extraction, or other approaches. At the very least, I hope langroid's implementation is useful to you -- you can use the DocChatAgent.get_verbatim_extracts(query, docs) as part of your pipeline, regardless of whether you are using langroid for your entire system or not.