overview for donotdrugs

GTX 4070ti and 32gb RAM run Llama 13b? in c/localllama@poweruser.forum

[–] donotdrugs@alien.top 1 points 1 year ago

I seem to read conflicting opinions on this.

This is probably due to the fact that most people don't use the original version of llama 13b and instead use quantized versions. The original model requires more than 12 GB VRAM but the quantized versions of llama 13b fit in less than 10 GBs of VRAM.

Quantization works by using lower precision integers for each parameter. So instead of having 13 billion parameters with 16 bit precision, quantized models have 13 billion parameters with just 8 or even 4 bits precision. This drastically reduces model size while retaining most of the performance.

You can download the quantized models from huggingface. User thebloke has uploaded quantized versions of pretty much every model in existence ever. You can find a link for llama2 13b here: https://huggingface.co/TheBloke/Llama-2-13B-GGML. There is a table with all the available versions as well as recommendations on what version to use.

To run these models you need to get llama.cpp. It's a framework/program for running these kinds of models.

PHIND V7: Red Flags in c/localllama@poweruser.forum

[–] donotdrugs@alien.top 1 points 1 year ago

Yeah and it baffles me how many people, even in the tech community, take LLM output as hard facts.

I am going to buy H100s. There are too many options. in c/localllama@poweruser.forum

[–] donotdrugs@alien.top 1 points 1 year ago

OP isn't buying them for his personal setup

Tbh I don't really see how this explains anything. Sure, OP doesn't go bankrupt buying it for the company but I'm 99% certain that it's still a bad financial decision.

FACT SHEET: President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence | The White House in c/localllama@poweruser.forum

[–] donotdrugs@alien.top 1 points 1 year ago (1 children)

Mistral did a good start with that