cstein123

joined 10 months ago
[–] cstein123@alien.top 1 points 10 months ago (1 children)

Exactly the answer I was looking for, thank you!

 

Can LLMs stack more layers than the largest ones currently have, or is it bottlenecked? Is it because the gradients can’t propagate properly to the beginning of the network? Because inference would be to slow?

If anyone could provide a paper that talks about layer stacking scaling I would love to read it!