this post was submitted on 24 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I'm currently trying to figure out where it is the cheapest to host these models and use them.

I realized that a lot of the finetunings are not available on common llm api sites, i want to use nous capybara 34b for example but the only one that offered that charged 20$/million tokens which seemed quite high, considering that i see Lama 70b for around 0.7$/million tokens.

So are there any sites where i could host custom finetunes and get similar rates to the one mentioned?

top 7 comments
sorted by: hot top controversial new old
[–] Kimononono@alien.top 1 points 11 months ago

would a service like runpod work for you? It sells you GPU power by the hour instead of by token

[–] m98789@alien.top 1 points 11 months ago (1 children)
[–] dwoodwoo@alien.top 1 points 11 months ago

Huggingface

[–] andrewlapp@alien.top 1 points 11 months ago

You might rent a GPU from runpod or another cloud provider.

Memory requirements:

34B Model Memory Requirements (infer)

Seq Len vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB
[–] AntoItaly@alien.top 1 points 11 months ago (2 children)

Replicate $0.000575/sec for a Nvidia A40 (48GB Vram)

[–] yahma@alien.top 1 points 11 months ago

The startup time makes Replicate nearly unusable for me. Only popular models stay in memory. Other less used models shutdown, and you need to wait for startup before first inference.

[–] No_Baseball_7130@alien.top 1 points 11 months ago

0.000575

that is nearly 2.1$ per hour. on https://runpod.io, you could get an a40 for 0.79$ / hr. for a 34b model, 24gb vram is more than enough so you could get a A5000 for around 0.44$ / hr