ShitGobbler69

joined 11 months ago

Running full Falcon-180B under budget constraint in c/localllama@poweruser.forum

[–] ShitGobbler69@alien.top 1 points 11 months ago (1 children)

FYI if all you're using it for is benchmarking (not like chat mode) you can probably do it in way less VRAM. You can load 1 layer into VRAM, process the entire set of input tokens, remember that output, load another layer into vram, repeat.

permalink
fedilink
source