Machine Learning

1 readers

1 users here now

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

submitted 2 years ago by Tiny_Cut_8440@alien.top to c/machinelearning@academy.garden

2 comments fedilink hide all child comments

Hi everyone,

We've recently experimented with deploying the CodeLlama 34 Bn model and wanted to share our key findings for those interested:

Best Performance: Quantized GPTQ, 4-bit CodeLlama-Python-34B model using vLLM.
Results: Average lowest latency of 3.51 sec, average token generation at 58.40/sec, and a cold start time of 21.8 sec on our platform, using Nvidia A100 GPU.

Other Libraries Tested: HuggingFace Transformer Pipeline, AutoGPTQ, Text Generation Inference.

Keen to hear your experiences and learnings in similar deployments!

you are viewing a single comment's thread
view the rest of the comments

[–] Striped_Orangutan@alien.top 1 points 2 years ago (1 children)

Thanks for sharing this. Have you used exllama 2 as well?

[–] Tiny_Cut_8440@alien.top 1 points 2 years ago

Not yet, made a note. Will add when i update the tutorial.