Most likely plugs multiple retrievers to figure out best candidates or linked content chunks to derive an answer
DarthNebo
If you have a GPU then I'd suggest setting up a TGI container with the correct model. If no GPU is available then use the server.cpp example in llama.cpp repository & simply invoke it from your GUI.
I'm looking to build one called OtherBrain, should work with local files as well & will live in the menu bar on macOS. Since Apple Silicon can do ~20tok/s all companies with employees who have newer Mac's should just leverage this amazing local inference device for daily workflows
I feel like LLMs should just do basic calculations themselves & everything else they should invoke a tool for verification
Use langchain with tools
The attention layers get replaced with flash attention 2, there's kv caching as well so you get way better batch1 & batchN results with continuous batching for every request
Run this with TGI or vLLM
You should look into continuous batching as most of your parallel requests are batch size 1 & heavily under utilising the VRAM & overall throughput that would have been easily possible.
I'd rather stick with pydantic declaration via Langchain than something that needs to be so hand written
The way to do this is to generate a bunch of hypothetical questions from the FAQ, index these in the vDB
Then for the user prompt do a two stage inference with very small CTX size which only determines if the user is asking a question related to items specifically mentioned on the FAQ. Then you can retrieve the relevant FAQ section or source document accordingly only if the score is within a threshold
Have you tried with FP4 & RAM offloading combined?