this post was submitted on 28 May 2026

16 points (69.0% liked)

Programming

27334 readers

251 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 3 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

UlrikHD@programming.dev

bugsmith@programming.dev

Spyro@programming.dev

Local LLM agents (lemmy.world)

submitted 3 weeks ago by Kkk2237pl@lemmy.world to c/programming@programming.dev

33 comments fedilink hide all child comments

Has anyone tried in organization to use self hosted llm models for agentic programming?

Im curious if it makes any sense. My organization spends fortune on tokens from us companies. I want to recommend something…

top 33 comments

sorted by: hot top controversial new old

[–] eleijeep@piefed.social 29 points 3 weeks ago (1 children)

My organization spends fortune on tokens

Perhaps recommend that they spend the money on hiring competent staff instead.

[–] adhdsergio@lemmy.world 9 points 3 weeks ago (1 children)

Do you realistically think that's gonna work?

[–] noxypaws@pawb.social 9 points 3 weeks ago (1 children)

no, there's no possible way that hiring professional software engineers will result in software engineering being performed.

[–] adhdsergio@lemmy.world 3 points 3 weeks ago

Well WE know that, sure

[–] 87Six@lemmy.zip 11 points 3 weeks ago (2 children)

Models running within the constraints of a dev machine have no chance

If you want this, you need a company AI server with enough performance to support the entire team at once, and it will probably still be worse than using a cloud one. Though it MIGHT pay for itself in.. A while

[–] Venat0r@lemmy.world 19 points 3 weeks ago (1 children)

it MIGHT pay for itself in.. A while

considering all the cloud ones are currently running at a loss, and hardware prices are way inflated: I doubt that.

[–] 87Six@lemmy.zip 6 points 3 weeks ago

If you think long term as a company that uses AI that's the way to go anyway, your own AI server.

But alas, nobody cares about the long term, because the cunts at the top of the AI stack always make sure to make things so volatile that the little guy can never survive past the short term.

The solution? Oh just pay the big corporations to be dependent on them and not build your own thing. Surely that will help.

You have to realize, that by your own words, AI subscription prices will skyrocket eventually. So the cost analysis of your own AI server has to take that into account too, not just the current prices and current upfront cost.

[–] Kkk2237pl@lemmy.world 3 points 3 weeks ago (3 children)

How about qwen 3.6 and MacBook with 64GB ram?

I thought about that AI server, but idk how to calculate how long it pay for itself..

[–] DaTingGoBrrr@lemmy.ml 3 points 3 weeks ago

I am running qwen 3.5 locally using llama.cpp on 8gb of VRAM and 16 gigs of RAM. It works well enough with a 4B to 9B parameter model along with quantization and MTP. More optimizations are on the way with turboquant and possibly other tech.

It's just there to assist me, not do all the work, so I am happy as long as I can self host it.

I can't say how well my specs would work in a professional setting but for personal use a MacBook should be sufficient in my opinion.

[–] 87Six@lemmy.zip 3 points 3 weeks ago (1 children)

I mean... RAM? Don't you need mass VRAM for this kind of thing? Or are they shared on Mac?

idk how to calculate how long it pay for itself…

You don't... Not in this industry. You guess and hope it goes in your favor.

No calculations matter if the market can jump or drop by 300% in a few months... And that applies to programming, hardware prices, AI subscription prices, regulations between countries when Trump is in office...

[–] SeductiveTortoise@piefed.social 8 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

Apple unified memory shares all over CPU, GPU and NPU, you can assign a lot of memory to run local models and there bandwidth is good, depending on the model.

AMD has something similar with their something something AI CPUs and they go up to 128GB at the moment. Apple can be way faster though. And you were able to buy a Mac Studio with 512GB back when RAM wasn't worth more than unicorn pee. For... I guess 10k though.

[–] 87Six@lemmy.zip 3 points 3 weeks ago (1 children)

Apple unified memory shares

That's cool asf.

Apple engineers with better leadership could change the fucking world... But instead they're used to screw over their own user base.

If my GPU starts falling back to RAM my game fps drops to 1 lol.

[–] PoY@lemmygrad.ml 1 points 3 weeks ago (1 children)

its shared sure, but the bandwidth is crap compared to a dedicated nvidia card. the performance will suffer, even though it allows you to run larger quants

[–] 87Six@lemmy.zip 1 points 2 weeks ago

Oh...

[–] MagicShel@lemmy.zip 1 points 3 weeks ago

I run this setup with 36GB (32+4). Local LLMs can be really effective BUT you are constrained by context size in a way you aren't on cloud services.

Cline supports running a local model through lmstudio but my experience feeding it any significant tasks is it just can't handle reading and holding the contexts to build components for enterprise scale applications.

I use Claude to write a lot of utility one-off scripts. With a maximum window of 1M tokens I can hit 30+% context just writing Python scripts. API contracts, development standards, existing reusable modules, and sometimes reading the code/documentation of the services I'm going to be calling.

My MacBook can't handle 300k token contexts. 30k seems doable. I should see how it handles my utility script folder...

Anyway that's still no Claude but if you need a cheaper model and you can afford for developers to spend time on it before ultimately deciding they need to spend for Claude or Codex or Gemini, then rubbing a local model on a beefy MacBook is 100% an option.

Stepping up from there to building a locally hosted LLM is probably the worst of all worlds. It will be a beefy CapEx, prone to saturation by all the users, and you will most likely still have to punt the hardest jobs to cloud AI. It can certainly be done and done well, but the best example I know runs on $250-500k worth of hardware (to service a pretty big number of users to be fair).

[–] eager_eagle@lemmy.world 9 points 3 weeks ago* (last edited 3 weeks ago) (2 children)

Qwen 3.6 and gemma4 models are the only ones usable for agentic prog sessions that I and my employer run locally. It's less stable and slower than third-party services, even on much better hardware (as it's with my employer). The best way is to go with a provider hosting deepseek flash/pro if your privacy policy allows though. It's going to be hard to beat their price.

[–] onlinepersona@programming.dev 2 points 3 weeks ago (1 children)

I thought those didn't support tool calling. Has that changed?

[–] eager_eagle@lemmy.world 4 points 3 weeks ago

they do

[–] adhdsergio@lemmy.world 1 points 3 weeks ago (1 children)

How many concurrent users and what hardware if i may ask?

[–] eager_eagle@lemmy.world 3 points 3 weeks ago* (last edited 3 weeks ago)

it's an h100, I think, no idea about how many users

in my personal setup i use quantized versions on a 3080, which is not great, so I still lean a lot on APIs

[–] FishFace@piefed.social 9 points 3 weeks ago (4 children)

As far as I understand, the only way to get anything resembling usable output for coding is with massive, expensive, labouriously hand-tuned models, not local ones.

[–] Jestzer@lemmy.world 6 points 3 weeks ago

^^^ This. Tragically, locally run LLMs don’t even hold a candle to “good” cloud-based LLMs like Claude Code.

[–] locuester@lemmy.zip 6 points 3 weeks ago

Qwen 3.6 27B dense is really good. Very usable coding output

[–] Kkk2237pl@lemmy.world 5 points 3 weeks ago (2 children)

I see that qwen 3.5 has pretty good performance and can be run on macbook with 64GB ram

[–] Penta@lemmy.world 13 points 3 weeks ago

Qwen 3.6 is even better

[–] SmoothLiquidation@lemmy.world 1 points 3 weeks ago

I have played with qwen3-coder:30b for my hobby stuff running on my M5 max MacBook and it does alright. It is fast enough and I used ollama tools to let it request files. I haven’t used anything like Claude code to compare it to though, only a bit of the ChatGPT free tier stuff.

[–] irelephant@lemmy.dbzer0.com 3 points 3 weeks ago

Deepseek is pretty good the few times I tried it.

[–] spectrums_coherence@piefed.social 4 points 3 weeks ago

If you just want to avoid U.S. company, you can try mistrialAI.

[–] HelloRoot@lemy.lol 4 points 3 weeks ago* (last edited 3 weeks ago)

GLM is pretty good in mg experience, the company I currently freelance at runs it locally (in house server room) for compliance reasons. But it needs very beefy hardware.

[–] Auster@thebrainbin.org 3 points 3 weeks ago

Even commercial, service-based LLMs demand a great effort of quality control, with their hallucinations and all. With local ones being mid in comparison from my experience, my suggestion would be that if your company does try implementing any, they should make sure to at least pay a nice extra to the QA folks.

[–] nark3d@thelemmy.club 3 points 3 weeks ago

There's a useful split lurking in this. For narrow agentic work like retrieval over internal docs, structured classification, test scaffolding, deterministic refactor passes, a self-hosted 30B-class model can be fine and the inference economics work out at team scale. For multi-step planning and the harder agent loops, the frontier gap still shows up in the number of retries and the time-to-correct-answer.

The honest test is to pick the prompt category that's costing you the most and benchmark something like Qwen 2.5 Coder 32B or DeepSeek V3 against whatever you're paying for now. If the gap is small you've found your candidate. If it isn't, you've at least costed the gap accurately rather than guessing at it.

The two costs people underestimate are the GPU box (plus a second one for the eval/staging path) and the maintenance overhead. Model picks go stale fast and someone on the team has to own that, or you end up shipping a Llama 3.1 stack into 2026 because nobody rebuilt the harness for whatever's current.

[–] PoY@lemmygrad.ml 2 points 3 weeks ago

our CEO has been buying new hires desktop gaming machines for this reason.. currently they don't have squat for graphics cards but once the rug is pulled from the cloud model pricing he said he'll spend the $10k per machine to put a 96gb vram card in peoples' machines to run shit locally

[–] mesamunefire@piefed.social 1 points 3 weeks ago

Ive played around with a couple, mostly from hugging face. Some of the minimal modelsvare halfway decent at SQL and some specific ones are good with templates and html. You cam string them up for agentic work without issue. I found the performance worse than generation tools for the same software tasks. It was neat to try though.