The numbers appear to have OpenAI's finger-prints on them. I don't know if they're from an AI-risk mitigations perspective or for laying foundations for competitive barriers. Probably a mix of both.
At 30 trillion tokens, 10^26 float ops caps you at ~550 billion parameters (using float ops = 6 * N * D). Does this indirectly leak anything about OpenAI's current scaling? At 10 trillion tokens, it's 1.7 Trillion parameters. Bigger vocabularies can stretch this limit a bit.
This is highly interesting and unintuitive. Have you written down the details of your approach anywhere? Why did you interleave in the manner you did?
Have you tested on GSM8K or DROP? Something I noticed in the recent HFLB update is that a lot of high flying Mistral merges scored poorly on those two benchmarks. DROP scores in particular, plummeted.