this post was submitted on 27 Jun 2026
211 points (96.9% liked)
Selfhosted
60253 readers
613 users here now
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil.
-
No spam.
-
Posts are to be related to self-hosting.
-
Don't duplicate the full text of your blog or readme if you're providing a link.
-
Submission headline should match the article title.
-
No trolling.
-
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Oh, it's a MoE? That makes sense.
If you're getting MiMo at -ctx 85K ... you're within spitting distance of SOTA. You can do real work with that.
I take it MiMo doesn't do the Qwen "hyperventilate into a paper bag" loop as --ctx increases. Qwen's seem to be really sensitive to that at lower quants.
I'm using 27B via OR API and I swear the diff providers use entirely diff quants. Sometimes you get a genius and other times a drooling mess.
They 100% do. They’re probably serving “naive” FP8 via VLLM, which is worse than you’d think, especially if they flip on the awful FP8 KV cache.
In a local quant, you can stop quantized models from falling apart at higher CTX by leaving the attention heads at a higher quantization. As an example, with MiMo 2.5, I have all the MoE MLP layers at IQ3_KT, the dense experts at Q6K, but all the attention layers at Q8_0.
For Qwen 27B, I’m still experimenting, but leaning towards IQ4_KT for the MLPs, Q6K for attention, and Q8_0 for the small, very sensitive KV heads. Or a similar scheme as an exl3 quant.
That being said, sometimes even unquantized models fall apart in certain long context scenarios because the max advertised context is a lie. You just have to test them and see, but Qwen has certainly done this in the past.