Rendered at 12:02:29 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
brianolson 16 hours ago [-]
> Why aren’t these AI companies submitting to the TOP500 to show off their computing prowess?
my knowledge is 10+ years out of date, but once upon a time if they'd chosen to, Google could have had _several_ entries in the top 10 of the TOP500 list
It's just poker, they didn't want to tip their hand
davidmr 14 hours ago [-]
I’ve worked on several systems that had enough flop/s to make it in the top 5-10, but for which we never submitted benchmarks. Sometimes their backend network layout technically would make them several smaller clusters for an HPL run, sometimes it’s because the cluster is too heterogeneous to get a good benchmark result, and sometimes it’s because the employer wants to keep a low profile.
Most of the time, it just that it’s a hassle. It takes a while to prep and tune a big hero run for benchmarking, and if you spend a billion dollars on a cluster, it’s making you a lot more than that. Taking it down for a day or two stops the money printers.
fragmede 5 hours ago [-]
What programs were yours running to print money?
ziofill 15 hours ago [-]
Also, would those 550k Blackwell have good FP64 performance? How would one even compare them?
NitpickLawyer 9 hours ago [-]
Yeah, that's true. GPU flops are really impressive in fp16, and more recently fp8 and fp4. When the 40xx GPUs came out, Tim Dettmers had a really cool blog looking into the numbers, and for fp8 one 4090 GPU had enough flops to match the best supercomputer somewhere in the 2000s, with an 8x 4090 build being top for a few years more. It's insane. But it gets nowhere close on fp64, where most of the physics simulations and other usual supercomputer tasks shine.
fragmede 3 hours ago [-]
Other than weather simulations and nuclear explosions, what other supercomputer tasks are out there?
NitpickLawyer 2 hours ago [-]
Although probably underfunded compared to the two that you listed, cosmology is probably up there as well. Things like early universe and galaxy formation, megastructures, etc.
There's likely also some need for fusion plasma containment and other related simulations.
wanderingmind 5 hours ago [-]
My sense is you only submit if you are in the business of selling supercomputing cluster (IBM, Cray). If you are a consumer or build to consume internally, you would care less.
JumpCrisscross 11 hours ago [-]
Is there international value to these designations? As in, would it be worth it for the U.S. to pay a bonus to anyone who qualifies into the TOP500, to offset the cost of the run?
jandrewrogers 6 hours ago [-]
Most of the US systems in the TOP500 are funded by the US government. It isn’t considered a meaningful demonstration of capability by most people in the know.
iberator 15 hours ago [-]
Cloud computing is not a supercomputer. Different architecture, bandwitch, interconnectivity and latencies.
dgacmu 15 hours ago [-]
That's not nearly as true when you look at AI training clusters. They're basically supercomputers but without an FP64 focus.
(These are the systems to which GP was referring at Google.)
cynicalkane 15 hours ago [-]
Even before AI training clusters became important, Google has had an outstanding custom fabric (there's papers about it) together with the ability to tune NICs for their own cases, and "their own cases" meant nearly everything engineered within Google. Ethernet hardware has had low kernel latency and DMA for a long time; it's the rest of the stack that hurts. But as far back as the early 2010s (if not further back, that goes beyond my knowledge horizon), you could just make it not hurt, if you had the software engineers to do it.
mrlongroots 8 hours ago [-]
All that would not help you with an AI training cluster interconnect. See Amin Vahdat's keynote at HotInterconnects 2025. Everyone is building a fabric for this stuff from scratch (Google/Falcon, Amazon/EFA, Azure/MANA, Cornelis/CN5000, and obviously Mellanox).
jeffbee 15 hours ago [-]
I thought TPUs couldn't reasonably run LINPACK at all because TPUs do not acknowledge that FP64 exists.
I know Google wants to compare their stuff to El Capitan or whatever but the comparison does not seem valid to me.
wmf 15 hours ago [-]
Historically there have been a bunch of clusters on the Top 500 that weren't used for HPC. The tell is that they used Ethernet (this was before RoCE). It's less efficient but you can still get an OK Linpack score.
15 hours ago [-]
ls612 13 hours ago [-]
Why would the scientific computing people want to tip their hand? It’s an open secret that the main point of these mammoth FP64 compute machines is to simulate nuclear weapons detonations to comply with the CTBT you’d think that crowd would really not be fans of broadcasting their capabilities.
kube-system 12 hours ago [-]
In adversarial scenarios, there are varying strategies in communicating one's capabilities, just as one might do in a poker game.
Sometimes you want to show off what you can do to dissuade others from fucking with you. Sometimes you want to undersell your capabilities to hide your true ability. Sometimes you want others to think you are underselling your capabilities when you are actually at a disadvantage.
mrlongroots 8 hours ago [-]
It is partly this and partly a funding vehicle for American next-gen computing. It is not that hard to estimate FP64 ballpark from a whole bunch of public statistics. And it takes a looot more than raw FLOPs to get a simulation working. And presumably a looot more to translate it into practice. And the openness makes it easier to talk to different vendors and not get in the way of them having all the H1Bs it takes to get these things to work.
Plus one think I like to say is that if a bullet is flying towards you, you could know everything about the chemistry of the gunpowder and the composition of the alloy without it affecting what happens next.
dopa42365 11 hours ago [-]
What for? You only need to match the performance that existed in the 1950s. In the Soviet Union. Everything else is a lack of knowledge rather than computing power.
Also you should read the second sentence of the CTBT Wikipedia article to find out why it's not even in force (spoiler: US hasn't ratified it).
8 hours ago [-]
Onavo 11 hours ago [-]
At some point, you will get diminishing returns no? I don't think compute is the bottleneck right now for mechanical engineering if you don't count AI.
jandrewrogers 15 hours ago [-]
TOP500 hasn't been a particularly useful measure of practical computing power in modern systems for many years because what it measures isn't a significant bottleneck in most real systems. It has become a measure of how much money someone is willing to spend for bragging rights. (HPCG is better in that it is a bit more bandwidth focused but still pretty narrow.)
Most companies with huge systems don't participate.
adrian_b 7 hours ago [-]
This seems like a "sour grapes" comment.
The new Chinese supercomputer beats all US supercomputers also in HPCG, not only in Linpack.
What is remarkable is that this was done despite the US attempts of sabotaging HPC in China by "sanctions".
This uses custom CPUs designed in China, which implement an Armv9-A ISA with SME (scalable matrix extension) and which use fast HBM memory. These CPUs are fast enough that they do not need any GPUs for exceeding the throughput of the American supercomputers, which use GPUs. This is like in the Japanese Fugaku, which was the first to implement the Armv8-A ISA with SVE, but which now is rather old.
Like in all CPU-based supercomputers, for this new Chinese supercomputer it is much easier to reach a higher percentage of the theoretical maximum throughput, when solving any problem. So for most practical problems it will be faster than a GPU-based supercomputer that would have the same theoretical maximum throughput.
So this is a much more interesting supercomputer than those built by just buying some HPC racks from HPE (Cray). Because China was forbidden to buy the American equipment, they had to innovate and design their own. Eventually they made something better than what they could not buy.
jandrewrogers 6 hours ago [-]
It isn’t “sour grapes”, I remember when the HPC community largely abandoned these benchmarks two decades ago because they weren’t representative of anything real for most of them. The benchmark is a poor reflection of real workloads. There was a long period when the STREAM benchmark was the primary correlate with real-world performance for most HPC workloads but you can’t build a press release from that.
I don’t have a dog in this fight and I no longer work in HPC. Most modern workloads are severely bandwidth bound. The only aspect of the hardware that matters is bandwidth and that is not materially differentiated. The frontier is scheduler design, which is pure software and difficult computer science. HPC competitions avoid problems with a software solution because it isn’t in their interest as hardware manufacturers.
This result is impressive, sort of, but not in the way people are imagining. I was equally dismissive of the previous leader for the same reasons. For most applications, these benchmarks are legacy pagentry.
adrian_b 5 hours ago [-]
As I have mentioned, and as described in TFA, this supercomputer is also leading in memory bandwidth (4 TB/s per socket => correction, it is 8 Tb/s per socket).
I agree with what you say about benchmarks, but that is precisely why the advantage of this supercomputer over the following American supercomputers will be even greater in more demanding workloads than Linpack and HPCG.
It has already shown this by having an advantage in HPCG of greater than 26% over the fastest US system, while in Linpack its advantage is of only 22%.
Thus its position in the top cannot be dismissed as insignificant, because it more likely underestimates than overestimates this system.
Also the programming effort for writing an efficient program will be lower than for the GPU-based US supercomputers.
jandrewrogers 5 hours ago [-]
I definitely appreciate the CPU-centric approach. It appeals to my biases and aligns with my technical perspective.
The memory bandwidth is something you can buy. Exotics were >1 TB/s over a decade ago, so 4 TB/s in 2026 is not that impressive. For all practical purposes, these CPUs are also still exotics, you can’t just buy them. I would be very surprised if the memory bandwidth of US exotics haven’t improved over the last 10-15 years.
In any case, for real workloads scalability is mostly a software theory problem at this point and that is still a dark art without much literature.
adrian_b 5 hours ago [-]
Correction, I took the 4 TB/s from TFA, but the NextPlatform article clarifies that it is 4 TB/s per chiplet, but 8 TB/s per socket, so more impressive.
The only US-designed CPU "exotics" are the Intel Xeon Max CPU series, which use HBM like the Chinese CPUs, but which have a theoretical maximum throughput of only 1.6 TB/s per socket, i.e. 5 times slower than the new Chinese CPUs.
Moreover, the users of Intel Xeon Max complained that they cannot reach the theoretical memory bandwidth. I do not know if that was due to some bug that might have been solved later by Intel with a microcode update or a new mask set stepping.
The server CPUs with standard DIMMs, which will be launched by AMD and Intel next year, will have a memory bandwidth of around 1 TB/s per socket.
The AMD MI300 GPU used in the fastest US supercomputer has a memory throughput of 5.2 TB/s per socket, so lower than the 8 TB/s per socket of the Chinese CPU, which explains why the advantage of the Chinese system increases in the benchmarks more dependent on memory performance.
The latest AMD Instinct GPU, MI355X, increases the memory bandwidth to 8 TB/s, so equal to the Chinese CPU.
However, it may pass some time until someone will build such a big system with MI355X, though perhaps the existence of this new contender might prompt the US labs to upgrade their systems by replacing the older AMD GPUs with newer AMD GPUs.
bee_rider 15 hours ago [-]
I wonder if there would have been an opportunity to generate some finer-grained benchmarks with something like BiCGStab+ILU (or maybe CG+incomplete cholesky). Instead of CG+Gauss Seidel. The pitch being, you might have made different memory vs compute trade-offs with designing your cluster, but you should be able to select a fill-in factor for the preconditioner to suit it.
jandrewrogers 6 hours ago [-]
I think you could build more representative benchmarks that capture capability better. The tension is that HPC companies are pure hardware companies and you need a lot of help from software to make your hardware look good. They don’t like that. Most of our software sucks at the scales they try to benchmark.
Ironically, the related Graph500 benchmarks reflect this better. Performance is dependent more on using the hardware better than better hardware per se.
> We think it is highly likely that these LX2 chiplets are etched using SMIC 7 nanometer processes at the N+3 refinement, and we base that on the fact that the chip only runs at 1.55 GHz. That is nowhere near the 3 GHz that SMIC can push with that process, but it is probably lower to get the memory and core speeds more balanced. [1]
Despite what it says at that link, it is more likely to be based on Armv9.3-A ISA, because it supports SME.
In the CPU cores designed by the Arm company, SME has been added only in the latest generation of Armv9.3-A CPUs, which was launched last year.
For each level of Armv9, there are many mandatory features and many optional features.
If the Chinese CPU does not implement all the mandatory Armv9.3-A features (and we do not know anything about this), then it will still be considered only an Armv9.2-A CPU, but even in that case it should be referred as an Armv9.2-A + SME, in order to not confuse it with the Armv9.2-A CPUs that have been used for a few years in smartphones, laptops and mini-PCs and which do not have SME, so they cannot have a comparable performance.
chrisss395 12 hours ago [-]
I haven't kept up with the latest on supercomputing power, but I recall some years ago there being strong evidence that China had a couple of un-announced supercomputers that would have topped the charts. It makes me wonder what is publicly disclosed vs. actual.
Retr0id 14 hours ago [-]
Interesting to see PAC mentioned on the slide, I'd have assumed security features would be a waste of transistors on something so compute-optimized - but maybe they want to isolate workloads from each other?
fragmede 8 hours ago [-]
Yes exactly. it means you can sell isolated work units. With cloud being, well, cloud, if you can do that, you then have a market, with bidders able to bid higher for more capacity right now, vs a lower bid for eventually.
Extremely impressive accomplishment considering they did this with Chinese interconnects and Chinese chips. This is a wake up call.
jandrewrogers 15 hours ago [-]
TOP500 can be done with inexpensive silicon. It is more about a willingness to aggregate enough hardware in one place. As a benchmark, it tells you almost nothing about computing power or scalability for other applications because it doesn't exercise the bottlenecks most high-scale applications have.
NitpickLawyer 9 hours ago [-]
> TOP500 can be done with inexpensive silicon.
Didn't the DoD at one point build a 1k+ PS3 cluster based on their multi-core chip and had a mini supercomputer CotS?
I remember Sony not liking that people were buying them for other things rather than gaming (iirc they were losing money on hardware at the time) so they bricked linux support soon after.
packetslave 8 hours ago [-]
> Didn't the DoD at one point build a 1k+ PS3 cluster based on their multi-core chip and had a mini supercomputer CotS?
The Air Force did (Condor) and it hit #33 on the 2010 Top500.
The last time when China had the fastest supercomputer, it was more than 20 times slower than this one and more than 8 times less efficient in energy consumption.
Moreover, its capability was overestimated by the Linpack benchmarks and in other workloads its performance was much less impressive.
For this system, it is the opposite situation. Its result in Top500 underestimates it capability. In other more demanding workloads, where the influence of the memory bandwidth and latency is stronger, its advantage over the US supercomputers is greater than in Top500.
numpad0 3 hours ago [-]
It's the same. Nobody was buying POWER9 or A64FX systems, and manufacturers of those are no factor to anything.
One could theoretically drive home with a ready-to-go rack of non-American and/or non-x86 supercomputer nodes at any point in time across the last few decades, sometimes even with non-NVIDIA/AMD massively parallel coprocessor cards. Nobody did.
If China(or any country) would _ship_ these alternative supercomputer hardware, only then anything could change.
echelon 16 hours ago [-]
We're too busy regulating the tech, not granting access to US engineers and companies, arguing against power and data centers, stopping skilled immigration.
This is absolutely going to bite us in the face in five to ten years.
2OEH8eoCRo0 16 hours ago [-]
Separate issue that has nothing to do with US manufacturing or HPC. I think our retreat from science funding and offshoring advanced manufacturing is a bigger issue.
charcircuit 1 hours ago [-]
Shouldn't AWS be considered the Number 1 super computer? The amount of compute, storage, memory, etc available is ginormous.
ziofill 15 hours ago [-]
> Two cores are disabled per cluster.
I’m sure there is a good reason for this, which is..?
jandrewrogers 15 hours ago [-]
It is likely that those cores are dedicated to unrelated management, monitoring, and administrative tasks. This is common and many workloads are throttled on bandwidth anyway. For the purposes of the benchmark, those cores are not participating in the workload.
brianolson 14 hours ago [-]
Yield. Some fraction of cores had a speck of dust or something, but at 38/40 good cores per chip they got economical yield
tjhei 14 hours ago [-]
And then even if some nodes had 40/40 "good" cores, it would make load balancing a lot more complicated if core counts vary. Easier to turn them off at the hardware level.
dist-epoch 14 hours ago [-]
Couldn't some chips have 40 good cores, while others have only 36? Do they all need to be exactly 38?
b33f 14 hours ago [-]
Why are they not using GPUs? is it use cases that don't suit GPUs or because of the limitations they are imposing on themselves to use SMIC domestic chips?
adrian_b 7 hours ago [-]
If you can avoid GPUs, that is preferable.
The reason is that with GPUs it is far more difficult to reach a great percentage of the maximum theoretical throughput. Most GPU programs reach only a very small fraction of what is theoretically possible, and in the best cases one may reach something like 50% to 60% of the maximum.
This CPU-based supercomputer has demonstrated reaching 80% of the theoretical maximum throughput, and this is typical for CPU-based supercomputers. It is much easier to write efficient programs for CPUs.
The new custom Chinese CPUs, which use SME, the Arm Scalable Matrix Extension, are fast enough that they have beaten all GPU-based supercomputers, so there was no need to use GPUs.
Moreover these CPUs use HBM for a very fast memory interface, so in the benchmarks that depend more on memory bandwidth they have an even greater advance over the US GPU-based supercomputers. Thus there really was no point in using GPUs.
GPUs are necessary only when your CPUs are not good enough, which was not the case here.
In the recent past, the Japanese Fugaku used the same approach, of avoiding GPUs. At that time, their custom CPUs using the Armv8-A ISA with SVE were the first which used this ISA in HPC, but now that ISA variant is obsolete in comparison with the Armv9-A ISA with SME, which is implemented in these new custom Chinese CPUs.
saagarjha 46 minutes ago [-]
> in the best cases one may reach something like 50% to 60% of the maximum
If all you need to do is matmuls then you can definitely go past this
adrian_b 22 minutes ago [-]
You can go past this only on a matmul benchmark, which is seldom useful per se.
Linpack consists mostly of matmuls, but nonetheless there are additional operations that prevent GPUs to reach the high utilization of over 80% that is normal for CPUs, so that a throughput over 50% is considered good at the scale of supercomputers.
At the scale of a supercomputer, the utilization factor is considerably less than for an individual GPU or CPU, because the big matrix is split in blocks and the matrix multiplications are computed on different boards and in different racks, then the results are assembled, so there is a communication overhead.
The former Intel Xeon Phi, with a large number of cores that were weak except for their vector execution units, resembled GPUs in failing to reach a high utilization on Linpack.
wmf 11 hours ago [-]
I suspect Chinese GPUs (e.g. Biren) are not mature.
amelius 12 hours ago [-]
GPUs are for graphics (the G in GPU). These systems are used for more general computations.
galaxy_quest 11 hours ago [-]
I’m not sure if I’m missing a joke, but that’s why we have general purpose computing on graphics processing units (GPGPU) which is why 8/10 of the top 10 machines have GPUs.
antonvs 9 hours ago [-]
GPUs were for graphics. Now, they're mainly used for machine learning training and inference. The big tech companies are spending eye-watering amounts for GPUs - hundreds of billions of dollars a year each. That's the reason that Nvidia's market cap is at $4.66 trillion.
Would the AI “GW-scale” clusters be able to run the Top500 benchmarks meaningfully? And what might be the outcome?
adrian_b 7 hours ago [-]
No.
The AI oriented GPUs or TPUs have either weak FP64 throughput or they may not support FP64 at all.
They can compete neither with CPUs nor with GPUs that have good FP64 support, like the AMD CDNA datacenter GPUs, which occupy all the top places among American supercomputers.
NVIDIA has stopped improving the FP64 throughput even in their "datacenter" GPUs, abandoning this nowadays smaller market to AMD.
The AMD CDNA GPUs can be used for both HPC and AI, so only an AI cluster based on them could have dual use, but most who want AI choose NVIDIA.
mrlongroots 8 hours ago [-]
> And what might be the outcome?
DoE compute budgets are ~10B USD across labs. AI training is a trillion-dollar workload. Different league.
wmf 15 hours ago [-]
Yes, they should score well on Linpack as long as they use Ozaki emulation.
adrian_b 7 hours ago [-]
No, that is too slow.
Most claims about the cost of emulating FP64 on GPUs are wrong, because they assume that only the significand of floating-point numbers must be extended.
In reality it is even more important to extend the exponent, because with the exponent of FP32 overflows would be much too frequent in scientific/technical computations to accomplish anything.
The minimum FP64 emulation on FP32-capable GPUs requires 3 numbers per emulated FP64, which may be 3 FP32 numbers, or the exponent may be an Int32, if that works better on the target GPU. An emulated FP64 operation is likely to be at least 20 times slower than a FP32 operation.
That is much faster than the 1:64 ratio provided in hardware by an NVIDIA GPU, but even on the fastest FP32 GPUs it is too slow to compete with CPUs, in a professional setting.
FP64 emulation on a GPU can be useful only in a home computer, which may have a rather weak CPU and increasing the FP64 throughput using the GPU can be done at no additional cost, so it can be worthwhile.
dgellow 15 hours ago [-]
Just glad to see Hamburg mentioned :)
Hope you all didn’t suffer too much through the current heatwave
techsystems 16 hours ago [-]
Is it the first to reach 2 exaflops?
adrian_b 7 hours ago [-]
Yes.
The fastest US supercomputer, El Capitan at Lawrence Livermore National Laboratory, reaches only 1.809 exaflops, while this reaches 2.198, 22% higher.
On the HPCG benchmark, which is more strongly influenced by memory bandwidth, the advance over the GPU-based El Capitan is even greater, of 26.4%.
amelius 14 hours ago [-]
How many tokens/s? :)
TacticalCoder 8 hours ago [-]
And you know it needs a well-conceived OS to be able to run it: it's got to be... Microsoft Windows right?
The OS powering 0% of the 500 supercomputers of the Top 500. But this time, it has to be Windows, right? Amirite?
Ah, no, just kidding: it's "Kylin OS". It used to be a BSD derivative and now it's just based on Linux.
I know, I know: "It's a heavily modified Linux". Whatever, it's not Windows and that makes me very happy.
my knowledge is 10+ years out of date, but once upon a time if they'd chosen to, Google could have had _several_ entries in the top 10 of the TOP500 list
It's just poker, they didn't want to tip their hand
Most of the time, it just that it’s a hassle. It takes a while to prep and tune a big hero run for benchmarking, and if you spend a billion dollars on a cluster, it’s making you a lot more than that. Taking it down for a day or two stops the money printers.
There's likely also some need for fusion plasma containment and other related simulations.
(These are the systems to which GP was referring at Google.)
I know Google wants to compare their stuff to El Capitan or whatever but the comparison does not seem valid to me.
Sometimes you want to show off what you can do to dissuade others from fucking with you. Sometimes you want to undersell your capabilities to hide your true ability. Sometimes you want others to think you are underselling your capabilities when you are actually at a disadvantage.
Plus one think I like to say is that if a bullet is flying towards you, you could know everything about the chemistry of the gunpowder and the composition of the alloy without it affecting what happens next.
Also you should read the second sentence of the CTBT Wikipedia article to find out why it's not even in force (spoiler: US hasn't ratified it).
Most companies with huge systems don't participate.
The new Chinese supercomputer beats all US supercomputers also in HPCG, not only in Linpack.
What is remarkable is that this was done despite the US attempts of sabotaging HPC in China by "sanctions".
This uses custom CPUs designed in China, which implement an Armv9-A ISA with SME (scalable matrix extension) and which use fast HBM memory. These CPUs are fast enough that they do not need any GPUs for exceeding the throughput of the American supercomputers, which use GPUs. This is like in the Japanese Fugaku, which was the first to implement the Armv8-A ISA with SVE, but which now is rather old.
Like in all CPU-based supercomputers, for this new Chinese supercomputer it is much easier to reach a higher percentage of the theoretical maximum throughput, when solving any problem. So for most practical problems it will be faster than a GPU-based supercomputer that would have the same theoretical maximum throughput.
So this is a much more interesting supercomputer than those built by just buying some HPC racks from HPE (Cray). Because China was forbidden to buy the American equipment, they had to innovate and design their own. Eventually they made something better than what they could not buy.
I don’t have a dog in this fight and I no longer work in HPC. Most modern workloads are severely bandwidth bound. The only aspect of the hardware that matters is bandwidth and that is not materially differentiated. The frontier is scheduler design, which is pure software and difficult computer science. HPC competitions avoid problems with a software solution because it isn’t in their interest as hardware manufacturers.
This result is impressive, sort of, but not in the way people are imagining. I was equally dismissive of the previous leader for the same reasons. For most applications, these benchmarks are legacy pagentry.
I agree with what you say about benchmarks, but that is precisely why the advantage of this supercomputer over the following American supercomputers will be even greater in more demanding workloads than Linpack and HPCG.
It has already shown this by having an advantage in HPCG of greater than 26% over the fastest US system, while in Linpack its advantage is of only 22%.
Thus its position in the top cannot be dismissed as insignificant, because it more likely underestimates than overestimates this system.
Also the programming effort for writing an efficient program will be lower than for the GPU-based US supercomputers.
The memory bandwidth is something you can buy. Exotics were >1 TB/s over a decade ago, so 4 TB/s in 2026 is not that impressive. For all practical purposes, these CPUs are also still exotics, you can’t just buy them. I would be very surprised if the memory bandwidth of US exotics haven’t improved over the last 10-15 years.
In any case, for real workloads scalability is mostly a software theory problem at this point and that is still a dark art without much literature.
The only US-designed CPU "exotics" are the Intel Xeon Max CPU series, which use HBM like the Chinese CPUs, but which have a theoretical maximum throughput of only 1.6 TB/s per socket, i.e. 5 times slower than the new Chinese CPUs.
Moreover, the users of Intel Xeon Max complained that they cannot reach the theoretical memory bandwidth. I do not know if that was due to some bug that might have been solved later by Intel with a microcode update or a new mask set stepping.
The server CPUs with standard DIMMs, which will be launched by AMD and Intel next year, will have a memory bandwidth of around 1 TB/s per socket.
The AMD MI300 GPU used in the fastest US supercomputer has a memory throughput of 5.2 TB/s per socket, so lower than the 8 TB/s per socket of the Chinese CPU, which explains why the advantage of the Chinese system increases in the benchmarks more dependent on memory performance.
The latest AMD Instinct GPU, MI355X, increases the memory bandwidth to 8 TB/s, so equal to the Chinese CPU.
However, it may pass some time until someone will build such a big system with MI355X, though perhaps the existence of this new contender might prompt the US labs to upgrade their systems by replacing the older AMD GPUs with newer AMD GPUs.
Ironically, the related Graph500 benchmarks reflect this better. Performance is dependent more on using the hardware better than better hardware per se.
https://arxiv.org/abs/2605.08633v1
https://www.servethehome.com/arm-cpus-take-number-1-in-lates...
https://www.top500.org/news/lineshine-debuts-no-1-top500-ent...
Based on the ARMv9.2.
[1] https://www.nextplatform.com/hpc/2026/06/25/a-deep-dive-on-c...
In the CPU cores designed by the Arm company, SME has been added only in the latest generation of Armv9.3-A CPUs, which was launched last year.
For each level of Armv9, there are many mandatory features and many optional features.
If the Chinese CPU does not implement all the mandatory Armv9.3-A features (and we do not know anything about this), then it will still be considered only an Armv9.2-A CPU, but even in that case it should be referred as an Armv9.2-A + SME, in order to not confuse it with the Armv9.2-A CPUs that have been used for a few years in smartphones, laptops and mini-PCs and which do not have SME, so they cannot have a comparable performance.
Deep link: https://top500.org/lists/top500/list/2026/06/
Didn't the DoD at one point build a 1k+ PS3 cluster based on their multi-core chip and had a mini supercomputer CotS?
I remember Sony not liking that people were buying them for other things rather than gaming (iirc they were losing money on hardware at the time) so they bricked linux support soon after.
The Air Force did (Condor) and it hit #33 on the 2010 Top500.
The last time when China had the fastest supercomputer, it was more than 20 times slower than this one and more than 8 times less efficient in energy consumption.
Moreover, its capability was overestimated by the Linpack benchmarks and in other workloads its performance was much less impressive.
For this system, it is the opposite situation. Its result in Top500 underestimates it capability. In other more demanding workloads, where the influence of the memory bandwidth and latency is stronger, its advantage over the US supercomputers is greater than in Top500.
One could theoretically drive home with a ready-to-go rack of non-American and/or non-x86 supercomputer nodes at any point in time across the last few decades, sometimes even with non-NVIDIA/AMD massively parallel coprocessor cards. Nobody did.
If China(or any country) would _ship_ these alternative supercomputer hardware, only then anything could change.
This is absolutely going to bite us in the face in five to ten years.
I’m sure there is a good reason for this, which is..?
The reason is that with GPUs it is far more difficult to reach a great percentage of the maximum theoretical throughput. Most GPU programs reach only a very small fraction of what is theoretically possible, and in the best cases one may reach something like 50% to 60% of the maximum.
This CPU-based supercomputer has demonstrated reaching 80% of the theoretical maximum throughput, and this is typical for CPU-based supercomputers. It is much easier to write efficient programs for CPUs.
The new custom Chinese CPUs, which use SME, the Arm Scalable Matrix Extension, are fast enough that they have beaten all GPU-based supercomputers, so there was no need to use GPUs.
Moreover these CPUs use HBM for a very fast memory interface, so in the benchmarks that depend more on memory bandwidth they have an even greater advance over the US GPU-based supercomputers. Thus there really was no point in using GPUs.
GPUs are necessary only when your CPUs are not good enough, which was not the case here.
In the recent past, the Japanese Fugaku used the same approach, of avoiding GPUs. At that time, their custom CPUs using the Armv8-A ISA with SVE were the first which used this ISA in HPC, but now that ISA variant is obsolete in comparison with the Armv9-A ISA with SME, which is implemented in these new custom Chinese CPUs.
If all you need to do is matmuls then you can definitely go past this
Linpack consists mostly of matmuls, but nonetheless there are additional operations that prevent GPUs to reach the high utilization of over 80% that is normal for CPUs, so that a throughput over 50% is considered good at the scale of supercomputers.
At the scale of a supercomputer, the utilization factor is considerably less than for an individual GPU or CPU, because the big matrix is split in blocks and the matrix multiplications are computed on different boards and in different racks, then the results are assembled, so there is a communication overhead.
The former Intel Xeon Phi, with a large number of cores that were weak except for their vector execution units, resembled GPUs in failing to reach a high utilization on Linpack.
The AI oriented GPUs or TPUs have either weak FP64 throughput or they may not support FP64 at all.
They can compete neither with CPUs nor with GPUs that have good FP64 support, like the AMD CDNA datacenter GPUs, which occupy all the top places among American supercomputers.
NVIDIA has stopped improving the FP64 throughput even in their "datacenter" GPUs, abandoning this nowadays smaller market to AMD.
The AMD CDNA GPUs can be used for both HPC and AI, so only an AI cluster based on them could have dual use, but most who want AI choose NVIDIA.
DoE compute budgets are ~10B USD across labs. AI training is a trillion-dollar workload. Different league.
Most claims about the cost of emulating FP64 on GPUs are wrong, because they assume that only the significand of floating-point numbers must be extended.
In reality it is even more important to extend the exponent, because with the exponent of FP32 overflows would be much too frequent in scientific/technical computations to accomplish anything.
The minimum FP64 emulation on FP32-capable GPUs requires 3 numbers per emulated FP64, which may be 3 FP32 numbers, or the exponent may be an Int32, if that works better on the target GPU. An emulated FP64 operation is likely to be at least 20 times slower than a FP32 operation.
That is much faster than the 1:64 ratio provided in hardware by an NVIDIA GPU, but even on the fastest FP32 GPUs it is too slow to compete with CPUs, in a professional setting.
FP64 emulation on a GPU can be useful only in a home computer, which may have a rather weak CPU and increasing the FP64 throughput using the GPU can be done at no additional cost, so it can be worthwhile.
The fastest US supercomputer, El Capitan at Lawrence Livermore National Laboratory, reaches only 1.809 exaflops, while this reaches 2.198, 22% higher.
On the HPCG benchmark, which is more strongly influenced by memory bandwidth, the advance over the GPU-based El Capitan is even greater, of 26.4%.
The OS powering 0% of the 500 supercomputers of the Top 500. But this time, it has to be Windows, right? Amirite?
Ah, no, just kidding: it's "Kylin OS". It used to be a BSD derivative and now it's just based on Linux.
I know, I know: "It's a heavily modified Linux". Whatever, it's not Windows and that makes me very happy.