this post was submitted on 01 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I’m fascinated by the whole ecosystem popping up around llama and local LLMs. I’m also curious what everyone here is up to with the models they are running.

Why are you interested in running local models? What are you doing with them?

Secondarily, how are you running your models? Are you truly running them on a local hardware or on a cloud service?

top 28 comments
sorted by: hot top controversial new old
[–] Cerus@alien.top 1 points 1 year ago

Gamified RP backend for an interface I'm making with the Godot game engine, running 13B EXL2 models on a 16GB 4060TI, hooked up via the streaming WS API in Ooba.

Out of all the silly hobbies I've poked around with, this has been by far the most interesting.

[–] AppleBottmBeans@alien.top 1 points 1 year ago (1 children)

mostly asking it perverted questions with sexual overtones

[–] ttkciar@alien.top 1 points 1 year ago

I don't know why you're getting downvoted. By my best reckoning, about two-thirds of this sub's regulars use LLM inference for smut.

It's not one of my use-cases, but to each their own, and it's undeniably helping advance the state of the art (much as the online porn industry helped advance web development).

[–] SomeOddCodeGuy@alien.top 1 points 1 year ago (6 children)

Trying to get a better understanding of how prompts work in relation to fine-tunes, and trying to see if any of them are actually reliable enough to be used in a "production" type environment.

My end goals are basically

  • A reliable AI assistant that I know is safe, secure and private. Any information about myself, my household or my proprietary ideas won't be saved on some company's server to be reviewed and trained upon. I don't want to ask sensitive questions about stuff like taxes or healthcare or whatnot, just to have some person review it and it end up in a model
  • Eventually create a fine-tuned coding model for the languages I care about. Right now they're all python, and ChatGPT is ok but they keep accidentally breaking it while trying to put up more guardrails against people doing crazy stuff. One day it's great at javascript, the next it's terribad. I need consistency, and I've realized that with proprietary models I don't get that. A model in my home? I do.
  • Eventually create an IoT service across my home that is managed (with tight constraints) by an AI. Lots of guardrails. I don't trust generative AI to not set my thermostat to 150 degrees lol.
  • Tinker with these things while they're still new so that I can know how it works under the hood, so that when AI becomes more mainstream I'll have a leg up, since my field (development) feels like it's right there with artists on the chopping block when AI gets better lol
  • I'm putting together some guides and tutorials to help others get into open source AI too. The more folks who can tinker with it, the better.
  • Finally, I'm creating an AI assistant prompt card that will make one who won't lie to me/hallucinate as much, and will speak in a more natural language while still having the knowledge it needs to answer questions well for me. I'm trying model after model looking for the right one that will help accomplish this. So far, XWin 70b using Vicuna instruction templates has been fantastic for this.

A lot of it comes down to just wanting to learn, but a big piece of it is that I have consistency, stability and privacy when running an LLM at home.

As for how I run it? Ho ho ho... a bit overkill, since as a developer I have a lot of hardware available to me.

  • M2 Ultra Mac Studio 192GB- main inference machine. It has 147GB of VRAM available. This acts as a headless server that I connect to on any device in my house. My main AI assistant runs off of this
  • My main desktop is an RTX 4090 machine windows box, so I run phind-codellama on it most of the time. If I need to extend the context window then I swap the M2 Ultra to phind so I can do 100,000 token context... but otherwise its so darn fast on the 4090 running q4 that I use that mostly.
  • A macbook pro that runs a little Mistral 7b on it. It also acts a server when I'm not on it, allowing my windows machine to have all 3 models running at once.

I usually connect the mistral to Continue.Dev in Visual Studio code.

[–] Aperturebanana@alien.top 1 points 1 year ago (1 children)

Literally such a cool written post. But boy your gear is so much pricier than a chatgpt subscription.

[–] thetaFAANG@alien.top 1 points 1 year ago (1 children)

tax deductible if you use your imagination

and you get to play with gear you already wanted

and you get experience for super high paying jobs

just comes down to fitting it within your budget to begin with

[–] Infamous_Charge2666@alien.top 1 points 1 year ago

to tax deduct anything you have to earn. and most users here are students ( undergrads/phd's/ masters) that make less to deduct 10k in pc hardware.

Best way is to ask your program "( phd) for sponsoring, or if undergrad to apply to scholarships

[–] simcop2387@alien.top 1 points 1 year ago

A reliable AI assistant that I know is safe, secure and private. Any information about myself, my household or my proprietary ideas won't be saved on some company's server to be reviewed and trained upon. I don't want to ask sensitive questions about stuff like taxes or healthcare or whatnot, just to have some person review it and it end up in a model

I'm slowly working on a change to Home Assistant (https://www.home-assistant.io/) to take the OpenAI conversation addon that they have and make it support connecting to any base url. Along with that I'm going to make some more addons for other inference servers (particularly koboldcpp, exllamav2, and text-gen-webui) so that with all their new voice work this year I can plug things in and have a conversation with my smart home and other data that I provide it.

[–] Aperturebanana@alien.top 1 points 1 year ago

I just checked out continue.dev and thank god for you what a cool thing! Any way to connect GPT4 with an API to visual studio code?

[–] LostGoatOnHill@alien.top 1 points 1 year ago

Love your post and ambitions, very inspiring. Looking to do similar, with family engaging assistant connecting to home automation and private data. Look forward to seeing more on what you build, anywhere in particular you share aside from here?

[–] taxis-asocial@alien.top 1 points 1 year ago

least wealthy /r/LocalLLaMa user

[–] hugganao@alien.top 1 points 1 year ago

My main desktop is an RTX 4090 windows box, so I run phind-codellama on it most of the time. If I need to extend the context window then I swap the M2 Ultra to phind so I can do 100,000 token context... but otherwise its so darn fast on the 4090 running q4 that I use that mostly.

are you running exllama on phind for 4090? was there a reason you'd need to run it on m2 ultra when switching to 100k context?

also, I didn't know mistral could do coding tasks, how is it?

[–] Finnegans_Father@alien.top 1 points 1 year ago

"You wouldn't understand, Jim. It's a secret."

"Wait - it's a secret? Or I wouldn't understand it?"

"You wouldn't understand, Jim. It's a secret."

[–] rebleed@alien.top 1 points 1 year ago

My biggest complaint about GPT3.5/4 is how every response is a single-serving response that is conclusive. This type of endlessly closed-end dialogue doesn’t reflect real human communication or thought. Although prompt-engineering can get around this problem, you also have to struggle against the censorship and strong bias towards being indefinite. By indefinite, I mean that it often refuses to state plainly what is true and what is false. For example, ask GPT how many penguins died in car accidents last year. The correct answer is “Zero.” But you’ll get something like “it’s highly unlikely that…” Try to get it to output “zero” and you’ll find it isn’t so easy.

Add all this up, and you start to realize that there are certain things that GPT isn’t suited for. In my use-case, that is creative writing. A model that is both atomically conclusive and stubbornly indefinite is somewhat useless in writing text that is inconclusive and definite.

[–] avvyie@alien.top 1 points 1 year ago

I have mistral-7b-openorca.Q5_K_M.gguf currently running in proxmox debian container with 8 CPU cores, 8 gb ram using llama.cpp python. Speed is slightly slower than what we get on bing chat but its absolutely usable/fine for a personal, local assistant. I have coded using llama.cpp python binding and exposed chat UI to a local url using Gradio python lib. This has been very useful so far as an AI assistant for big/small random requests from phone, pc, laptops at home. I am also using this from outside using Cloudflare tunnels(in a separate network which i use to expose services).

I also have a similar setup using llama.cpp (compiled for amd gpu) on a sightly powerful linux system where I have created a linux script based invoking for a different model. I call this script using linux alias "summon-{modelname}" in shell and model is ready to serve directly from command line for my questions.

[–] fab_space@alien.top 1 points 1 year ago

I’m newbie and i have no GPU at home but i dig into the topic at many layer.. experimenting and so on with also paid or free APIs and all home lab tools the dream rush is giving out there and it’s quite engaging..

after several iterations (2 months full free time spent on that) i’m back to create proper datasets.

Datasets are the most engaging where any model then can fit flawlessly later on. Data interpretation is the king.

Please make me wrong at this, yet another chance to learn in the topic.

In my experiment since i deliver blacklists i’m playing with such flow:

  1. getting data in an ethical way
  2. processing data
  3. train data
  4. rank domains

the application part can be:

  • browser extension
  • blacklists improvements
  • domain rankings

the required in this context is a proper made dataset more than the best GPU powered model 💻

Sorry if this is quite off 🦙

PS: i’m not dev and i use gpt to code some, the best iteration and lesson learned:

k now please make this function parallel, up to 32 concurrent

and i learned concurrent processing of python 🐍🙏

[–] GreenockScatman@alien.top 1 points 1 year ago

Goblin smut

[–] 1dayHappy_1daySad@alien.top 1 points 1 year ago

1 - Horny stuff

2 - Waiting for a smart enough model that can be fed a JSON in a fixed interval, providing to it information of the environment, as in a simplified version of our brain getting a constant stream of info. That object sometimes will contain user input or not, (user input as in the regular questions we asks these models). Some models can keep up with this for a bit but eventually lose track. If anyone has done anything like this or has any tips / suggestions , I'll happily accept them.

[–] softwareweaver@alien.top 1 points 1 year ago

We are using Local Models to power Fusion Quill's AI Word Processor. Fusion Quill is a Windows App on the Microsoft Store.

We are currently using the Mistral 7B model to do various tasks like Summarization, Expand Content, etc. See our YouTube Video
https://youtu.be/883IoDlRzpM

We expect Local AI models to continue to evolve but we also support Open AI's Chat GPT API and vLLM APIs.

Regards,
Ash
FusionQuill.AI

[–] levraimonamibob@alien.top 1 points 1 year ago

its porn. Its always porn.

Sensitive data

[–] GoofAckYoorsElf@alien.top 1 points 1 year ago

Quite some the stuff that commercial/corporate models won't let me do and which I wouldn't do even if they let me. Private stuff. Yes, NSFW can of course be a part of it.

Furthermore, things where I think the commercial/corporate models are too expensive (no, I have not checked my power bill yet...).

[–] tylerjdunn@alien.top 1 points 1 year ago

Writing code with Continue and Code Lllama (or one of the fine-tunes) :)

[–] IONaut@alien.top 1 points 1 year ago

Part of it is trying to find models that will work in my own projects so I don't have to rely on OpenAI or some other API. And I'm also interested in the lightweight end of things that may run well on isolated devices without a connection like robots or wearable assistants in isolated places.

If I'm developing something with AutoGen or MemGPT that uses a lot of API calls, testing using a local LLM is free.

Not to mention the company I work for is interested in using AI but would rather not send customer data to any of these big companies.

[–] AnOnlineHandle@alien.top 1 points 1 year ago

I've tried training LoRas on my own writing with mixed success.

[–] ChangeIsHard_@alien.top 1 points 1 year ago

First, out of principle, because I don't want OpenAI to use my inputs for RLHF to then replace me.

Second, I want much more freedom for experimentation, and I can't have that with a cloud API where I have to constantly worry about how many tokens I consume, which translates to $$$

[–] Equal-Bug1591@alien.top 1 points 1 year ago

i haven’t started yet, but come to read and learn here every day. I want to see if possible to develop single board computer that can run a usable local model, plus voice. All this without any wireless connection.

Really curious what hardware would be needed, to run the model plus the voice recognition and synthesis. I can do all the HW design, PCB routing, low level C, asm, verilog, etc. But still so much to understand about modern AI tech. Really exciting to learn new things, and have a hobby project outside of work.

[–] GoalSquasher@alien.top 1 points 1 year ago

Making a perfect Hank hill silly tavern character card. I just want him to sell me propane