would a service like runpod work for you? It sells you GPU power by the hour instead of by token
this post was submitted on 24 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
Hf?
Huggingface
You might rent a GPU from runpod or another cloud provider.
Memory requirements:
34B Model Memory Requirements (infer)
Seq Len vs Bit Precision
SL / BP | 4 | 6 | 8 | 16
-----------------------------------------------------------
512 | 15.9GB | 23.8GB | 31.8GB | 63.6GB
1024 | 16.0GB | 23.9GB | 31.9GB | 63.8GB
2048 | 16.1GB | 24.1GB | 32.2GB | 64.3GB
4096 | 16.3GB | 24.5GB | 32.7GB | 65.3GB
8192 | 16.8GB | 25.2GB | 33.7GB | 67.3GB
16384 | 17.8GB | 26.7GB | 35.7GB | 71.3GB
Replicate $0.000575/sec for a Nvidia A40 (48GB Vram)
The startup time makes Replicate nearly unusable for me. Only popular models stay in memory. Other less used models shutdown, and you need to wait for startup before first inference.
0.000575
that is nearly 2.1$ per hour. on https://runpod.io, you could get an a40 for 0.79$ / hr. for a 34b model, 24gb vram is more than enough so you could get a A5000 for around 0.44$ / hr