Huge fan of modal, have been using them for a couple serverless LLM and Diffusion models. Can be definitely on the costly side, but like that the cost directly scales based on requests and setup is trivial.

recent project with modal: https://github.com/sshh12/llm-chat-web-ui/tree/main/modal

[–] Ok-Goal@alien.top 1 points 2 years ago

In our internal lab office, we're using https://ollama.ai/ with https://github.com/ollama-webui/ollama-webui to locally host LLMs, docker compose provided by ollama-webui team worked like a charm for us.

[–] clxyder@alien.top 1 points 2 years ago (1 children)

Do you have hardware to serve the API or do you want to run this from the cloud?

[–] decruz007@alien.top 1 points 2 years ago

Looking at cloud as an option. Don’t really have hardware now.

[–] dazld@alien.top 1 points 2 years ago

Did you think about running out of a local m1 Mac mini? Ollama uses the Mac GPU out of the box.

[–] m0dE@alien.top 1 points 2 years ago

fullmetal.ai

[–] apepkuss@alien.top 1 points 2 years ago (1 children)

WebAssembly based open source LLM inference (API service and local hosting): https://github.com/second-state/llama-utils

[–] RustyLanguage@alien.top 1 points 2 years ago

hmm cool. seems the size for the inference app only a few MBs

[–] ImNewHereBoys@alien.top 1 points 2 years ago (1 children)

Just curious. What are you using it for?

[–] decruz007@alien.top 1 points 2 years ago

Knowledge base, general GPT use, interaction with our CMS to add or update data.

[–] carlosglz11@alien.top 1 points 2 years ago

Let us know what you end up going with op! I’m interested in something like this as well…

[–] DreamGenX@alien.top 1 points 2 years ago

I can recommend vLLM. Also offers OpenAI compatible API service, if you want that.

[–] openLLM4All@alien.top 1 points 2 years ago

I noticed TheBloke was using Massed Compute to quantize models. I've been poking around and using their hardware a bit more