this post was submitted on 30 Nov 2023
1 points (100.0% liked)
Machine Learning
1 readers
1 users here now
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
In general? Because deep RL is a skyscraper made of sticks and glue. Nothing, and I mean nothing, is actually guaranteed to work or has any kind of theoretical foundation at all. There are guarantees for toy problems, but everything past that is the wild West. In practice it's janky in a way no other field of ML is.
The standard way of learning value functions is to use the Temporal Difference update. Except we've known since the 90's that this doesn't really work -- sometimes the solutions diverge, and there's no known way of ensuring the neural net weights won't all go to infinity. In practice this means that frequently authors will do multiple runs, and only report the runs where the weights don't explode. Even if your weights don't explode, in general policy class are not expressive enough to learn the optimal max-entropy policy, and even if they are, the loss isn't convex. It's possible to learn the right value function and not be able to recover the optimal policy.
And even granted that those don't cause issues, you have to have an exploration strategy. Exploration is far and away the hardest problem in machine learning. You have to reason about the expected value of places where you don't have data to make estimates. And even when you do have estimates, none of your data is iid. It's basically impossible to do any kind of normal statistics to solve exploration. If you look into the literature on exploration and online learning, you'll instead find some incredibly unusual math, most frequently involving an algorithm called Mirror Descent that does gradient descent in non-Euclidean geometry. But even that's really only usable for toy problems right now. The only viable strategy for real problems is trial and error.
Model based RL is looking a little more stable in the last year. Dreamerv3 and TD-MPC2 claim to be able to train on hundreds of tasks with no per-task hyperparameter tuning, and report smooth loss curves that scale predictably.
Have to wait and see if it pans out though.
I think this is overstating the contribution of these kinds of works. They still learn a Q-function via Mean-Squared Bellman Error, which means they're subject to the same kind of instability in the value function as DDPG. They use a maximum entropy exploration method on the policy, which doesn't come with exploration efficiency guarantees (at least not ones that are anywhere near optimal). The issue is that RL is extremely implementation-dependent. You can correctly implement an algorithm that got great results in a paper and have it still crash and burn.
At a basic level, the issue is that we just don't have sound theory for extending RL to continuous non-linear MDPs. You can try stuff, but it's all engineers' algorithms, not mathematicians' algorithms -- you have no idea if or when it'll all break down, and if it does all break down, they're not gonna tell you that in the paper. Fundamentally we need theoretical work showing how to correctly solve these kinds of problems, and that's something a problem that these experimentally-focused papers are not attempting to address.
Progress requires directly addressing these issues. In my opinion, that's most likely to come though theoretically-driven work. For the value-divergence problem, that means Gradient Temporal Difference algorithms and their practical extensions (such as TD with Regularized Corrections). For exploration, that means using insights from online learning, like best-of-both-worlds algorithms that give a clear "exploration objective" that policies can optimize.
Active inference is looking like a viable alternative, but not that mature
Damn, no wonder my pet RL project barely ever worked