Research papers have also observed diminishing returns issues as models grow.
Hell maybe even GPT-4 was hit by this and that's why GPT-4 is not a single giant language model but running a mixture of experts design of eight 220B models trained for subtasks.
But I think even this architecture will run into issues and that it's more like a crutch. I mean, you'll eventually grow each of these subtask models too large and might need to split them as well, but this might mean you run into too small/niche fields per respective model and that sounds like the end of that road to me.
Research papers have also observed diminishing returns issues as models grow.
Hell maybe even GPT-4 was hit by this and that's why GPT-4 is not a single giant language model but running a mixture of experts design of eight 220B models trained for subtasks.
But I think even this architecture will run into issues and that it's more like a crutch. I mean, you'll eventually grow each of these subtask models too large and might need to split them as well, but this might mean you run into too small/niche fields per respective model and that sounds like the end of that road to me.