Lmdeploy is another one
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
-
exllamav2(CUDA/ROCm explicity targeting low vram usage): https://github.com/turboderp/exllamav2
-
MLC-LLM(extremely fast vulkan/metal/WebGPU): https://github.com/mlc-ai/mlc-llm
-
S-LORA(vLLM extension to host a bunch of LoRAs): https://github.com/S-LoRA/S-LoRA
-
Intel LLM Runtime(Fastest CPU only inference(?)): https://github.com/intel/intel-extension-for-transformers
-
Optimum Habana (Intel alternative to Nvidia cloud): https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation
-
EasyLM (TPU Training): https://github.com/young-geng/EasyLM
There's more I'm sure I'm forgetting, not to speak of all the "general" machine learning compilers out there. I would recommend checking out TVM, MLIR, Triton, AITemplate, Hidet, and projects associated with them (like MLC, PyTorch-MLIR, Mojo, and torch.compile backends in PyTorch).
https://github.com/merrymercy/awesome-tensor-compilers#open-source-projects
Do you have any idea why MLC isn't a more used format? It seems so much faster than GGUF or ExLlama architectures, yet everyone defaults to those
Thats an excellent question.
I'm going insane with all of these options. Someone really needs to do a comparison of them.