this post was submitted on 28 Nov 2023

1 points (100.0% liked)

LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

llama.cpp server rocks now! 🤘 (alien.top)

submitted 2 years ago by Gorefindal@alien.top to c/localllama@poweruser.forum

18 comments fedilink hide all child comments

So I was looking over the recent merges to llama.cpp’s server and saw that they’d more or less brought it in line with Open AI-style APIs – natively – obviating the need for e.g. api_like_OAI.py, or one of the bindings/wrappers like llama-cpp-python (+ooba), koboldcpp, etc. (not that those and others don’t provide great/useful platforms for a wide variety of local LLM shenanigans).

As of a couple days ago (can't find the exact merge/build), it seems as if they’ve implemented – essentially – the old ‘simple-proxy-for-tavern’ functionality (for lack of a better way to describe it) but *natively*.

As in, you can connect SillyTavern (and numerous other clients, notably hugging face chat-ui — *with local web search*) without a layer of python in between. Or, I guess, you’re trading the python layer for a pile of node (typically) but just above bare metal (if we consider compiled cpp to be ‘bare metal’ in 2023 ;).

Anyway, it’s *fast* — or at least not apparently any slower than it needs to be? Similar pp and generation times to main and the server's own skeletal js ui in the front-ends I've tried.

It seems like ggerganov and co. are getting serious about the server side of llama.cpp, perhaps even over/above ‘main’ or the notion of a pure lib/api. You love to see it. apache/httpd vibes 😈

Couple links:

https://github.com/ggerganov/llama.cpp/pull/4198

https://github.com/ggerganov/llama.cpp/issues/4216

But seriously just try it! /models, /v1, /completion are all there now as native endpoints (compiled in C++ with all the gpu features + other goodies). Boo-ya!

top 18 comments

sorted by: hot top controversial new old

[–] aseichter2007@alien.top 1 points 2 years ago

I'm pretty sure that makes it compatible with Clipboard Conqueror too!

[–] dirkson@alien.top 1 points 2 years ago (1 children)

I can't seem to get it to work. Sillytavern asks for "/v1/completions", which doesn't seem to be provided by the llama.cpp api.

[–] Jelegend@alien.top 1 points 2 years ago (2 children)

I am using it as http://localhost:8000/v1/completions and it is working perfectly

[–] Spasmochi@alien.top 1 points 2 years ago

Did you use one of the example servers or just executre the default one at ./server ?

[–] KrazyKirby99999@alien.top 1 points 2 years ago

Docker or native?

[–] herozorro@alien.top 1 points 2 years ago

will this speed up ollama project?

[–] Water-cage@alien.top 1 points 2 years ago

Thanks for the heads up, I’ll give it a try tomorrow

[–] SatoshiNotMe@alien.top 1 points 2 years ago (1 children)

You mean we don’t need to use llama-cpp-Python anymore to serve this at an OAI-like endpoint?

[–] reallmconnoisseur@alien.top 1 points 2 years ago

Correct. You run llama.cpp server and inside your code/gui whatever you set OpenAI base API to the server's endpoint.

[–] Inkbot_dev@alien.top 1 points 2 years ago (1 children)

I really, really, hope they add support for chat_templates for the chat/completion endpoint: https://huggingface.co/docs/transformers/chat_templating

[–] Inkbot_dev@alien.top 1 points 2 years ago

Well, fingers crossed my plea for actually supporting chat templates works. Partial support is equal to no support in this case.

https://github.com/ggerganov/llama.cpp/issues/4216#issuecomment-1829944957

[–] edwios@alien.top 1 points 2 years ago (1 children)

Have changed from llama-cpp-python[server] to llama.cpp server, working great with OAI API calls, except multimodal which is not working. Patched it with one line and voilà, works like a charm!

[–] sleeper-2@alien.top 1 points 2 years ago (1 children)

send a PR with your patch!

[–] edwios@alien.top 1 points 2 years ago (1 children)

Already raised an issue, couldn’t create a PR as I’m with my phone only. Solution also included:

https://github.com/ggerganov/llama.cpp/issues/4245

[–] sleeper-2@alien.top 1 points 2 years ago

sweet, ty 😎!

[–] sleeper-2@alien.top 1 points 2 years ago (2 children)

huge fan of server.cpp too! I actually embed a universal binary (created with lipo) in my macOS app (FreeChat) and use it as an LLM backend running on localhost. Seeing how quickly it improves makes me very happy about this architecture choice.

I just saw the improvements issue today. Pretty excited about the possibility of getting chat template functionality since currently all of that complexity has to live in my client.

Also, TIL about the batching stuff. I'm going to try getting multiple responses using that.

[–] Inkbot_dev@alien.top 1 points 2 years ago

It's not looking so great that they actually support the feature, but would rather hard code templates into the cpp, ignoring what the model is define with it it doesn't match.

I made my case for it, but there seems to be resistance to doing it at all... there may be options to load a python jinja script from cpp if the dependencies exists, and fall back to the hard coded impl if not, but people seem very resistant to do anything of the sort. And the cpp jinja port seems to be too heavy weight for their tastes...

[–] Gorefindal@alien.top 1 points 2 years ago

*Love* FreeChat!