ron_krugman

joined 2 years ago

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! in c/localllama@poweruser.forum

[–] ron_krugman@alien.top 1 points 2 years ago

That doesn't make much of a difference. You still have to transfer the whole model to the GPU for ever single inference step. The GPU only saves you time if you can load the model (or parts of it) once and then do lots of inference steps.

permalink
fedilink
source
context