this post was submitted on 12 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Use case is that I want to create a service based on Mistral 7b that will server an internal office of 8-10 users.

I’ve been looking at modal.com, and runpod. Are there any other recommendations?

top 15 comments
sorted by: hot top controversial new old
[–] Belnak@alien.top 1 points 1 year ago
[–] navrajchohan@alien.top 1 points 1 year ago
[–] sshh12@alien.top 1 points 1 year ago

Huge fan of modal, have been using them for a couple serverless LLM and Diffusion models. Can be definitely on the costly side, but like that the cost directly scales based on requests and setup is trivial.

recent project with modal: https://github.com/sshh12/llm-chat-web-ui/tree/main/modal

[–] Ok-Goal@alien.top 1 points 1 year ago

In our internal lab office, we're using https://ollama.ai/ with https://github.com/ollama-webui/ollama-webui to locally host LLMs, docker compose provided by ollama-webui team worked like a charm for us.

[–] clxyder@alien.top 1 points 1 year ago (1 children)

Do you have hardware to serve the API or do you want to run this from the cloud?

[–] decruz007@alien.top 1 points 1 year ago

Looking at cloud as an option. Don’t really have hardware now.

[–] dazld@alien.top 1 points 1 year ago

Did you think about running out of a local m1 Mac mini? Ollama uses the Mac GPU out of the box.

[–] m0dE@alien.top 1 points 1 year ago

fullmetal.ai

[–] apepkuss@alien.top 1 points 1 year ago (1 children)

WebAssembly based open source LLM inference (API service and local hosting): https://github.com/second-state/llama-utils

[–] RustyLanguage@alien.top 1 points 1 year ago

hmm cool. seems the size for the inference app only a few MBs

[–] ImNewHereBoys@alien.top 1 points 1 year ago (1 children)

Just curious. What are you using it for?

[–] decruz007@alien.top 1 points 1 year ago

Knowledge base, general GPT use, interaction with our CMS to add or update data.

[–] carlosglz11@alien.top 1 points 1 year ago

Let us know what you end up going with op! I’m interested in something like this as well…

[–] DreamGenX@alien.top 1 points 1 year ago

I can recommend vLLM. Also offers OpenAI compatible API service, if you want that.

[–] openLLM4All@alien.top 1 points 1 year ago

I noticed TheBloke was using Massed Compute to quantize models. I've been poking around and using their hardware a bit more