FYI if all you're using it for is benchmarking (not like chat mode) you can probably do it in way less VRAM. You can load 1 layer into VRAM, process the entire set of input tokens, remember that output, load another layer into vram, repeat.
FYI if all you're using it for is benchmarking (not like chat mode) you can probably do it in way less VRAM. You can load 1 layer into VRAM, process the entire set of input tokens, remember that output, load another layer into vram, repeat.