I've been hearing Q* = Q-learning + A* (search algorithm).
Trying to make some sense of it, so let me know what I missed or got wrong
here's what I know: It's supposed to improving language model decoding.
-
Q-learning is a form of model-free reinforcement learning where an agent learns to maximize a cumulative reward. When applied to language models, the actions could be the selection of tokens, with the reward being the effectiveness of the generated response.
-
A* is an informed search algorithm, or a best-first search, which uses heuristics to estimate the best path to the goal. In language generation, the goal could be the most coherent and contextually relevant completion (chat response).
- Beam Search in Decoding: This method is used in LLMs, looks at a set of possible next sequences instead of just the single most likely next token.
In a hypothetical Q* approach:
-
Informed Token Selection: It could use heuristics, based on context and language understanding, to guide the selection of token sequences.
-
Maximizing Future Reward: Like Q-learning, it would aim to maximize a future reward, potentially based on coherence, relevance, or user engagement with the generated text.
-
Beyond Simple Probability Multiplication: Rather than merely multiplying probabilities of token sequences, it could evaluate sequences based on a combined heuristic and reward-based framework.
In theory this could lead to more effective, contextually relevant text generation, especially in scenarios that require a balance between creativity and specific guidelines or objectives.
Credit to u/mrleibniz: https://www.youtube.com/watch?v=PtAIh9KSnjo&t=3755s&pp=2AGrHZACAQ%3D%3D