As another commenter has pointed out, 2 is an active area of research; it's much easier to experiment with sampling in decoding because it generally involves a fixed model.
For your example, I believe nucleus sampling would solve that because the probability of the correct token should be very high (although i've only read cursory summaries, haven't read the paper/implementation in depth)
They're probably working on causal inference. When you mention causal inference, I naturally think of causal graphs and linear models (and maybe occasionally random forests), so maybe that's where people get the distinction? One thing in this domain I've worked on (in medium-sized tech) is notifications:
We say that we want to send exactly x notifications per user per day. Then train a model to predict P(DAU | send k notifications that day) and send the notifications that give you the highest P(DAU) uplift.
Some people would probably call this Causal ML; I didn't think about confounders or causal graphs a single time while working on this, so I wouldn't say I was working on causal inference here (I'd just say I was doing ML, but hmm maybe I should update my resume to say "Causal ML"...)