Huawei outperforms NVIDIA at the "cluster" level. Which are mostly turnkey systems for datacenter units. And promises truck container level cluster for next generation that is 30x the zetaflops as NVIDIA rubin cluster. China currently operates at 50% electric production capacity, and energy extremely abundant and low price, which make the per level card performance deficit irrelevant.
Technology
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
To be fair, the raw FLOPs count doesn't tell the whole story. On a lot of workloads (including token generation during LLM inference), you're bound by the memory bandwidth rather than throughput/FLOPs. On H100/H200, keeping the tensor cores fully occupied is surprisingly difficult, and that's with 3+ TB/s of memory bandwidth. And I believe those cards have much higher throughput (at least at FP8, Ascend wins at FP4 since H100/200 don't support it) compared to Ascend.
The Ascend 950PR units have far lower memory bandwidth, reportedly at 1.4 TB/s. Compare that to Blackwell, which has something like 8TB/s of bandwidth. I believe they're manufacturing their own kind of HBM, so that's still really impressive considering this is a fairly recent push into manufacturing accelerators. But I'm a bit skeptical it actually outperforms NVIDIA at scale.
Huawei's clusters have close to 4x the ram as NVIDIAs, and TFLOPs is most relevant to training. Huawei has better interconnect technology than NVIDIA, but incompatible with H200s, and so for China/friends use, it's a much better package. Price/performance of 910 vs 5090 or 6000ada is much higher at single card level. The power cost/availability in China gives them much higher potential deployment rates. Chinese cloud rates tend to be lower than the same model on US clouds.
Yeah I can believe their interconnect is better, given their extensive history in networking.
W.r.t TFLOPs, let me clarify what I meant. Even on traditionally compute-bound workloads (attention, etc.), on H200 it's actually surprisingly difficult to make full use of the card's throughput before hitting VRAM bandwidth limits. Tensor core throughput has grown a lot faster than bandwidth has.
I've never written a kernel for Huawei chips so I have no idea if they have the same problem. But this problem is there on many datacenter-class NVIDIA chips, which is why they keep introducing features (TMA, TMEM, etc.) to try and lower the time wasted waiting for memory.
It almost feels like the Trump administration is trying to help Chinese companies at this point.
So many things they're doing feel like this. A few things I could see being a conscious and planned deal but most things completely go against any interest they have so it's clear that it's just an unbelievable level of incompetence, stupidity and ignorance.
I have a hard time thinking it's not deliberate. They're trying to protect US car companies by putting tariffs on Chinese car imports, but what that's doing is leading to complacency and a lack of competition. Ultimately harming the US and setting us back while allowing the wealthy shareholders to hoard wealth while not needing to worry about staying competitive.
Chinese companies are heavily incentivized to use Chinese chips instead of American since Trump blocked trade with China.
China used to parallel import the chips they needed, and even repackage them with more onboard RAM, making more powerful Nvidia solutions available in China than in the rest of the world.
But Trumps behavior towards China made the Chinese government decide to limit the use of American technologies for AI.
There was a point where Nvidia exports to China was basically at a standstill, because China forbade the purchase of a new cut down Nvidia chip made for the Chinese market to circumvent American trade restrictions.
China is building their own complete stack now, replacing everything with Chinese technologies, right from the AI chips to the entire AI software framework.
So not only does Nvidia and other American companies lose hardware sales, the entire stack will be threatened with a Chinese alternative, that will likely compete with American options on the international market in the future. If Cuda loses its current dominance, it will be easier for competitors to take marketshare from Nvidia.
Hopefully this will be good for consumers worldwide.
Please do i can't be so wet. European are eager to ditch Americans. In the long run Chinese seem to a be a more reliable partner
I would prefer it wasn't like this, PAX Americana seemed to work quite well for several decades, of course USA served their own interests, but they they also provided a somewhat stable world order with a decent degree of freedom.
Now they have abandoned the ideals of freedom and democracy and international law, to serve their own interests exclusively at immense cost to others, without regard for either law or decency. and of course that is not sustainable to be an ally of.
I think USA will soon find that without allies, their power isn't so great after all.
You are talking about freedom and peace, but only for the global north.
It wasn't like that for rest of the world.
They were infiltrated by Epstein classes assets and a t fucked the system. At this time look at the tax rate of these deep fucks.
I think its chicken or the egg with the us government and dirty money scubags
For a short période rich fucker were taxed enough. They are still the fact that the grow of capitalist country where sustained by colonized country ressources.
Qwen is already the standard for actual pros as far as I can tell.
It’s only the standard for people who self host their llms and don’t have $500k to throw at hardware for GLM-5.1 or similar models.
I have qwen3.6:27b on my local hardware and it’s way better than I expected. I’m excited for the rest of the 3.6 line as it comes out, if they can keep up that quality.
This story is also a nothing burger. Generally, yes, Nvidia will suffer once chinas stack catches up (soon). By then whatever bubble we are in will have normalized one way or the other.
In terms of actually deploying this model, it doesn’t matter what hardware you’re using. VLLM supports almost everything with SIMD-type hardware instructions.
More competition will make everyone happy except Nvidia shareholders.
Gemma4:26b is also worth trying. I find it runs much faster on my hardware
Been using Qwen 3.x for a while now for local LLM with search capability. The 3.5 and 3.6 ones are great and run very fast.
I got Qwen 3.5 running on a Steam Deck.
It ain't exactly blazing fast, but it does actually work.
(Reasonably fast if you go down to the 2B param model, I can get the 9B param variant working, though this makes Steam Decky very hot and bothered.)
Yeah, you absolutely do not need Nvidia hardware to run an LLM, but we get blasted with their propoganda suggesting otherwise just all the time in the English speaking West.
Because if you don't need Nvidia, well, then, this whole AI bubble looks a lot more bubbly.
Take good care of your hw! It's not like 2 years ago when you could buy stuff off the shelf for reasonable prices. :D
My Steam Deck is my child.
Maybe if I can get it to run a 'good enough' LLM, and also a robotics kinematics suite...
I can just start building DOG, with a Steam Deck for a face, instead of a Combine scanner bot.
Gemma 4 seems nice for local usage, way faster than Qwen models.
I was able to run 27B Gemma on my PC, where 14B Qwen was to slow due to CPU offload
3.6 27b is probably most powerful/efficient (to size) model out there. Qwen has a history of leveraging deepseek power as well. (deepseek creating small models with Qwen as the base), and Alibaba is main hosting service for deepseek. Alibaba/Qwen in talks to invest in Deepseek, atm.
Yeah. The 80b Coder-Next runs at about the same speed on my hw too. I don't know if it's any better than 3.6 27b.
What’s left unsaid is the software architecture is extremely interesting, and efficient.
Ironically, the Nvidia embargo was the best thing to ever happen to the Chinese labs (which Nvidia tried to tell the US govt). It forced them to get thrifty, unlike US labs which (allegedly) fill some GPU farms with busywork for the appearance of high utilization.
Sorry if this is a dumb question, but is this just for training or does DeepSeek v4 now require these chips to run?
I don't think they only run on these chips. There are some companies in the US that provide Deepseek V4 presumably running on standard Nvidia chips.
Well I’ve got three 512gb Mac Studios in an EXO cluster I’m gonna see how it works.
It is for inference, which is generally good at open backends. Most models done in pytorch, with backend library.
You can run it on CPU alone. Not surprising they’re building their own AI ecosystem
It's still matrix multiplication. Running it on a general purpose CPU is inefficient.
I mean, sure. You could also run it by drawing marks in sand. It doesn't make any sense to do either, though.
Not at scale. Even on the new architecture, one really needs some kind of accelerator to make it economical for servers.
Bitnet-like models might change the calculus, but no major trainer had tried that yet.
is that why JESEN was telling people to not leave him.