DarthNebo

joined 1 year ago
[–] DarthNebo@alien.top 1 points 11 months ago

Have you tried with FP4 & RAM offloading combined?

[–] DarthNebo@alien.top 1 points 11 months ago

Most likely plugs multiple retrievers to figure out best candidates or linked content chunks to derive an answer

[–] DarthNebo@alien.top 1 points 11 months ago

If you have a GPU then I'd suggest setting up a TGI container with the correct model. If no GPU is available then use the server.cpp example in llama.cpp repository & simply invoke it from your GUI.

[–] DarthNebo@alien.top 1 points 11 months ago

I'm looking to build one called OtherBrain, should work with local files as well & will live in the menu bar on macOS. Since Apple Silicon can do ~20tok/s all companies with employees who have newer Mac's should just leverage this amazing local inference device for daily workflows

[–] DarthNebo@alien.top 1 points 11 months ago

I feel like LLMs should just do basic calculations themselves & everything else they should invoke a tool for verification

[–] DarthNebo@alien.top 1 points 1 year ago (1 children)

Use langchain with tools

[–] DarthNebo@alien.top 1 points 1 year ago

The attention layers get replaced with flash attention 2, there's kv caching as well so you get way better batch1 & batchN results with continuous batching for every request

[–] DarthNebo@alien.top 1 points 1 year ago (3 children)

Run this with TGI or vLLM

[–] DarthNebo@alien.top 1 points 1 year ago

You should look into continuous batching as most of your parallel requests are batch size 1 & heavily under utilising the VRAM & overall throughput that would have been easily possible.

[–] DarthNebo@alien.top 1 points 1 year ago

I'd rather stick with pydantic declaration via Langchain than something that needs to be so hand written

[–] DarthNebo@alien.top 1 points 1 year ago

The way to do this is to generate a bunch of hypothetical questions from the FAQ, index these in the vDB

Then for the user prompt do a two stage inference with very small CTX size which only determines if the user is asking a question related to items specifically mentioned on the FAQ. Then you can retrieve the relevant FAQ section or source document accordingly only if the score is within a threshold

view more: ‹ prev next ›