Selfhosted

60934 readers

861 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

Be civil.
No spam.
Posts are to be related to self-hosting.
Don't duplicate the full text of your blog or readme if you're providing a link.
Submission headline should match the article title.
No trolling.
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details. Tags [CBH] or [AIP] are required, see the links in Rule 8 for details.
AI-related discussions and AI-involved promotional posts have additional requirements for tagging, as noted in Rule 7 and the AI & Promotional Post Expanded Rules post, and find example disclosures here.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world

218

Selfhosted & AI (anarchist.nexus)

submitted 4 weeks ago by curbstickle@anarchist.nexus to c/selfhosted@lemmy.world

117 comments fedilink hide all child comments

Yup, I'm posting another this week. Sorry.

This week I'm hoping we can wrangle a solution around AI and our selfhosted community. There are plenty of strong opinions (both pro and con), but one thing is for certain - there needs to be better disclosure in promo posts. Two options (that aren't mutually exclusive):

Any posts of an AI focused, AI Developed, etc software gets an [AI] tag. No, a [Not-AI] tag is not needed to accomplish this, thats kind of a "non-golfer" sort of tag.
Comment requiring an AI disclosure response to every promo post, if its not detailed in the post itself. Specifics (generating docs for commands, translation, whole-boat vibe-coded this app, etc) would be requested.

I will say that having disclosure and/or tagging would mean that comments that just say "slop" or "fuck ai" or whatever would be off topic at that point, that information is already provided, so its just noise (and sometimes pretty uncivil - I've been light on that for now due to the need for a rule on this).

The tag [AI] would make it easy to filter out (or search for, if that's your thing), but there is a wildly different degree of AI use out there, and from the posts with a positive score, its usually due to responsible AI use (translations, a snippet they had to do something obscure with, available to use with AI but doesn't require it, whatever), which is why I think the disclosure has a place as a benefit to everyone.

Please provide any input or alternative options on this, and I can then put it to a vote like the last one. Comments seem to be the best approach without involving something off-site, but if you have a better idea/option, please share.

you are viewing a single comment's thread
view the rest of the comments

[–] SuspiciousCarrot78@aussie.zone 2 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

I'm still impressed you got any MiMo to work at home, at 10 tok/s.

For those trying to visualise that -

https://mikeveerman.github.io/tokenspeed/?rate=10&mode=agent&think=10

Is it a constant 10 or does it (it must do, right?) drop off as context increases?

I imagine you must have compaction or something to mitigate that.

[–] brucethemoose@lemmy.world 2 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

It’s drops off, but not as much as you’d think.

MiMo uses 5:1 SWA, so its long-context compute doesn’t increase as catastrophically as older models. That, and most of the “slowness” comes from the MoE layers being on CPU (whereas the attention layers that get heavier at high context are all on the 3090).

That’s the beauty of these MoEs: they’re just the right size for the “compute-lite” parts to stay in CPU RAM.

I will measure it tomorrow. It is a constant ~9-10TPS for short queries, but definitely slower near my current max context of 85K.

And do you mean prompt compaction? I don’t automate that; when I use that particular model, I tend to use it in Mikupad, aka “raw” notepad mode, and manipulate the context directly. This is so I can do things like chop out conversations, pick different tokens from the logprobs, or edit its own replies/thinking and continue mid reply.

I like manually handling this because, being a local model, prompts are cached. Streaming starts quickly if most of the prompt stays cached, which is actually a really nice advantage over APIs.

[–] SuspiciousCarrot78@aussie.zone 2 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

Oh, it's a MoE? That makes sense.

If you're getting MiMo at -ctx 85K ... you're within spitting distance of SOTA. You can do real work with that.

I take it MiMo doesn't do the Qwen "hyperventilate into a paper bag" loop as --ctx increases. Qwen's seem to be really sensitive to that at lower quants.

I'm using 27B via OR API and I swear the diff providers use entirely diff quants. Sometimes you get a genius and other times a drooling mess.

[–] brucethemoose@lemmy.world 3 points 3 weeks ago* (last edited 3 weeks ago)

They 100% do. They’re probably serving “naive” FP8 via VLLM, which is worse than you’d think, especially if they flip on the awful FP8 KV cache.

In a local quant, you can stop quantized models from falling apart at higher CTX by leaving the attention heads at a higher quantization. As an example, with MiMo 2.5, I have all the MoE MLP layers at IQ3_KT, the dense experts at Q6K, but all the attention layers at Q8_0.

For Qwen 27B, I’m still experimenting, but leaning towards IQ4_KT for the MLPs, Q6K for attention, and Q8_0 for the small, very sensitive KV heads. Or a similar scheme as an exl3 quant.

That being said, sometimes even unquantized models fall apart in certain long context scenarios because the max advertised context is a lie. You just have to test them and see, but Qwen has certainly done this in the past.