LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Silly questions about GGUF and exl2 (alien.top)

submitted 11 months ago by Desm0nt@alien.top to c/localllama@poweruser.forum

1 comments fedilink hide all child comments

Hi. I have LLaMA2-13B-Tiefighter-exl2_5bpw and (probably) the same LLaMA2-13B-Tiefighter.Q5_K_M.

I run it on 1080Ti and old threadripper with 64 4-channel DDR4-3466. I use oobabooga (for GGUF and exl2) and LMStudio. I have 531.68 Nvidia driver (so I recieve OOM, not RAM-swapping when VRAM overflows).

1st question: I read that exl2 consume less vram and work faster than gguf. I try to load it on Oobabooga (ExLlamaV2_HF) and it fits in my 11gb VRAM consume ~10gb) but produce only 2.5 t/s, while GGUF (lama.cpp backend) with 35 layers offloaded on GPU - 4.5 t/s. Why? I don't set some important settings?

2nd question: In LMStudio (lama.cpp backend?) with the same settings and same gpu offloaded 35 layers I got only 2.3 t/s. Why? Same backend, same GGUF, same settings for sampling and context.

top 1 comments

sorted by: hot top controversial new old

[–] tntdeez@alien.top 1 points 11 months ago

exl2 processes most things in FP16, which the 1080ti, being from the Pascal era, is veryyy slow at. GGUF/llama.cpp on the other hand is capable of using an FP32 pathway when required for the older cards, that's why it's quicker on those cards.