Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] System 2 Attention (is something you might need too) (alien.top)

submitted 2 years ago by APaperADay@alien.top to c/machinelearning@academy.garden

4 comments fedilink hide all child comments

Paper: https://arxiv.org/abs/2311.11829

Abstract:

Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what to attend to. S2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. In experiments, S2A outperforms standard attention-based LLMs on three tasks containing opinion or irrelevant information, QA, math word problems and longform generation, where S2A increases factuality and objectivity, and decreases sycophancy.

top 4 comments

sorted by: hot top controversial new old

[–] KakaTraining@alien.top 1 points 2 years ago (3 children)

The method in the paper is indeed simple and effective: removing irrelevant information through prompt. But is it necessary to dress up this simple method with a fancy neuroscience term?

[–] SnooHesitations8849@alien.top 1 points 2 years ago

I hate that part also at the same time enjoy it.

[–] reverendCappuccino@alien.top 1 points 2 years ago

Well, it's more like a psychological term, and attention is already there to illustrate the intended meaning of a dot product. The analogy holds up, so why doubting the validity of using system 2 attention rather than that of using attention at all?

[–] SatoshiNotMe@alien.top 1 points 2 years ago

That was exactly my thought! In Langroid (the agent-oriented LLM framework from ex-CMU/UW-Madison researchers), we call it Relevance Extraction — given a passage and a query, use the LLM to extract only the portions relevant to the query. In a RAG pipeline where you optimistically retrieve top k chunks (to improve recall), the chunks could be large and hence contain irrelevant/distracting text. We concurrently do relevance extraction from these k chunks: https://github.com/langroid/langroid/blob/main/langroid/agent/special/doc_chat_agent.py#L801
One thing often missed in this is the un-necessary cost (latency and token-cost) of parroting out verbatim text from context. In Langroid we use a numbering trick to mitigate this: pre-annotate the passage sentences with numbers, and ask the LLM to simply specify the relevant sentence-numbers. We have an elegant implementation of this in our RelevanceExtractorAgent using tools/function-calling.

Here's a post I wrote about comparing Langroid's method with LangChain's naive equivalent of relevance extraction called `LLMChainExtractor.compress` , and no surprise Langroid's methos is far faster and cheaper:
https://www.reddit.com/r/LocalLLaMA/comments/17k39es/relevance_extraction_in_rag_pipelines/

If I had the time, the next steps would have been: 1. give it a fancy name, 2. post on arxiv with a bunch of experiments, but I'd rather get on with building 😄