I'm hosting Goliath 120b with a much better quant (4.5b exl2, need 3x3090) and its scary, it feels alive sometimes. Also, with exllama2 it has about the same speed as a 70B model.
ortegaalfredo
joined 1 year ago
Because LLama2-70B is similar or better in most metrics, and it small enough to not need distributed inference.
LLMs on neuroengine.ai should support way more than 400 words. Don't know exactly the limit.
Check panchovix repo on huggingface.