this post was submitted on 30 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 1 year ago
MODERATORS
 

Hi everyone,

We've recently experimented with deploying the CodeLlama 34 Bn model and wanted to share our key findings for those interested:

  • Best Performance: Quantized GPTQ, 4-bit CodeLlama-Python-34B model using vLLM.
  • Results: Average lowest latency of 3.51 sec, average token generation at 58.40/sec, and a cold start time of 21.8 sec on our platform, using Nvidia A100 GPU.

CodeLlama 34Bn

  • Other Libraries Tested: HuggingFace Transformer Pipeline, AutoGPTQ, Text Generation Inference.

Keen to hear your experiences and learnings in similar deployments!

you are viewing a single comment's thread
view the rest of the comments
[โ€“] Striped_Orangutan@alien.top 1 points 11 months ago (1 children)

Thanks for sharing this. Have you used exllama 2 as well?

[โ€“] Tiny_Cut_8440@alien.top 1 points 11 months ago

Not yet, made a note. Will add when i update the tutorial.