overview for DarthNebo

How to run 70B on 24GB VRAM ? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

Have you tried with FP4 & RAM offloading combined?

ChatGPT RAG Implementation? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

Most likely plugs multiple retrievers to figure out best candidates or linked content chunks to derive an answer

local llm in own GUI in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

If you have a GPU then I'd suggest setting up a TGI container with the correct model. If no GPU is available then use the server.cpp example in llama.cpp repository & simply invoke it from your GUI.

Open source alternative to LMStudio in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

I'm looking to build one called OtherBrain, should work with local files as well & will live in the menu bar on macOS. Since Apple Silicon can do ~20tok/s all companies with employees who have newer Mac's should just leverage this amazing local inference device for daily workflows

Training LLMs to follow procedure for Math gives an accuracy of 98.5% in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

I feel like LLMs should just do basic calculations themselves & everything else they should invoke a tool for verification

Dolphin or Mistral function calling in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago (1 children)

Use langchain with tools

What kind of performance should we expect? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

The attention layers get replaced with flash attention 2, there's kv caching as well so you get way better batch1 & batchN results with continuous batching for every request

What kind of performance should we expect? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago (3 children)

Run this with TGI or vLLM

I scaled Mistral 7B to 200 GPUs in less than 5 minutes in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

You should look into continuous batching as most of your parallel requests are batch size 1 & heavily under utilising the VRAM & overall throughput that would have been easily possible.

PSA: GBNF exists. Use it. in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

I'd rather stick with pydantic declaration via Langchain than something that needs to be so hand written

Handling Generic Queries in RAG Applications in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 2 years ago

The way to do this is to generate a bunch of hypothetical questions from the FAQ, index these in the vDB

Then for the user prompt do a two stage inference with very small CTX size which only determines if the user is asking a question related to items specifically mentioned on the FAQ. Then you can retrieve the relevant FAQ section or source document accordingly only if the score is within a threshold