Very cool paper.
residentmouse
Great question, curious in the answer myself.
I think it’s pretty cool that just iteratively reusing an LLM without additional training, i.e chaining prompts, improves quality in most of these methods. I see quite a few of these papers (e.g System 2 Attention).
The Promptbreeder paper has some benchmarking of these methods & proposes an interesting evolutionary prompting strategy.
But like you I’ve been looking / waiting for the papers that explore specifically finetuning the model “nodes”, using LoRA perhaps, or with a meta network or hyper network.
Yeah, so largely I think you’ve hit the nail but just in case you don’t know the fervour is a deliberately leaked project name “Q*” and the suggestion it precipitated the OpenAI board drama. Now, is this probably a tactic to keep prices high so stock sells @ the 65B valuation OAI had prior to the drama? Sure.
But it’s still fun to speculate.
If you don’t know how many of the 3000 events are detectable by an expert, how do you know your 60% classifier isn’t better than an expert already?
OK, so full speculation: this project could be an impl. of Q-Learning (i.e unsupervised reinforcement learning) on an internal GPT model. This would no doubt be an agent model.
Other evidence? The * implies a graph traversal algorithm, which obviously plays a huge role in RL exploration, but also GPT models are already doing their own graph traversal via beam search to do next token prediction.
Are they perhaps hooking up an RL trained model to replace their beam search?
Well now I feel almost obligated to click - is the part of the title "deep dive" completely misleading or is the post really just a LoRA explanation?
I’d add a few others to this list but I largely agree with the premise that we focus too much on attention. We lavish praise on the Transformer model but there is so much extra machinery that goes into it to make it work even a little bit, and now papers are coming out claiming ConvNets scale at the same learning rate, and the RetNet paper claims you can swap out attention altogether.
Obv. the issue is “emergence” (terrible term, but I mean non-linear training performance) and the sheer cost of testing permutations of LLM architecture at scale. To what extent has the ML community become the victim of sunk cost?
Elron's track record with respect to product announcements is so woeful you'd be better off, statistically speaking, to assume Grok has zero chances of being open source now that it's been announced.