this post was submitted on 21 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I'm using vLLM because it's a drop in replacement for ChatGPT. If there is something else compatible with the ChatGPT API, let me know.

Problem 1: I cannot get anything over a 7B to run in vLLM. I'm sure my parameters are wrong, but I cannot find any documentation.

python3 -m vllm.entrypoints.openai.api_server --model /home/h/Mistral-7B-finetuned-orca-dpo-v2-AWQ --quantization awq --dtype auto --max-model-len 5000

Problem 2: Mistral-7B-finetuned-orca-dpo-v2-AWQ is the only one I got up and running with responses that make sense. However, there is a prompt being appended to everything I send to it:

### Human: Got any creative ideas for a 10 year old’s birthday?
### Assistant: Of course! Here are some creative ideas for a 10-year-old's birthday party: ... [It goes on quite a bit.]

Either because of that or for other reasons it is not answering very basic questions. There are several threads about this on Github, but was able to identify zero actionable information.

Problem 4: CodeLlama-13B-Python-AWQ just blasted a bunch of hastags and gobbledygook back at me. Same problem with the prompt too.

I am running this on an Ubuntu Server VM (16 cores/48gb RAM) right now so I don't take up any VRAM, but I can switch to Windows if necessary.

top 1 comments
sorted by: hot top controversial new old
[–] ThisGonBHard@alien.top 1 points 11 months ago

I think the API oobabooga is compatible with ChatGPT.

And you can get a 70B model running there in Exllama2 instead of 7B (22 t/s for me on a 4090), but are limited by the lower context of 70B.