ARM is free to reorder stores and loads. This is called a weak memory model. So unless it's explicitly told to the compiler, like C++ memory_order::acquire and memory_order::release, you might get invalid behavior. Heisenbugs in the worst case.
Also most software runs on ARM now and I don't think that has actually happened in practice.
At least in my house, ARM cores outnumber x86 cores by at least four to one. And I'm not even counting the 32-bit ARM cores in embedded devices.
There is a lot of space for memory ordering bugs to manifest in all those devices.
It's a fun perception. For the longest time, all the "serious" computers were used through networks and terminals and didn't even come with any ability to connect a monitor or a keyboard (although a serial terminal would work as the system console). I used to joke (usually looking at Unisys Windows-based big servers), if the computer had VGA and PS/2 ports, it wasn't a computer, but a toy. Those Unisys servers weren't toys, but you could run Pinball and Minesweeper directly on them, which kind of said otherwise.
I think we got used to such levels of platform bloat that we don't care if the UI toolkit these days is bigger than the entire operating system that runs 95% of the world's payment transactions.
That said, I've never actually run into one of these issues.
As you'd expect, Linux distribution support for big- and little-endian varies.
*a = 1;
*b = 'p';
both the compiler and the CPU can freely pick the order in which those two happen (or even execute them in parallel, or do half of one first, then the other, then the other half of the first, but I think those are hypothetical cases)x86-64 will never do such a swap, but x86-64 compilers might.
If you write
*a = 1;
*b = 2;
, things might be different for the C compiler because a and b can alias. The hardware still is free to change that order, though.Developed on Intel i860, then MIPS, and only then on x86, alongside Alpha.
First, I don't like making blind tradeoffs. If what I need (for whatever reason) is a really beefy ARM CPU, I'd like to know what the "Apple-less tax" costs me (if anything!)
Second, the status quo is that Apple Silicon is the undisputed king of ARM CPU performance, so it's the obvious benchmark to compare this thing against. Providing that context is just basic journalistic practice, even if just to say "but it's irrelevant because we can't use the hardware without the software".
The cores, yes, but you can get an AmpereOne with 192 ARM cores (or rent out beefier machines from AWS and Azure). If you need to run macOS, then you are tied to Apple, but if all you want is ARM (for, say, emulated embedded hardware development), you have other options in the ARM ecosystem. I'm actually surprised Ampere maxes out at 192 cores when Intel Xeon 6+ has parts with 288 cores on a single socket (and that can go up to 4 sockets).
I wonder how many cores you'd need to make htop crash.
We know how Apple's hardware performs on native workloads. We know how it performs emulating x86 workloads (and why). Surely "... and this is how this hardware measures up against the other guys trying to achieve the exact same thing" is a relevant comparison? I can't be the only person who reads "reaching desktop performance" and wonders "you mean comparable to the M1, or to the M3 Ultra?"
You're not. IMHO it's a fairly obvious, narrow and uncontroversial observation (and hence why its the top comment). That said, I personally still enjoyed the back and forth as many others one could imagine. There can be value in the counterarguments from multiple other usernames, as this facilitates sharpening reasoning for the conclusion from readers. (even when the original premise stays in tact)
The lack of others agreeing could be the result of many reasons. IMHO, a not insignificant one could be the incentive structure skews heavily towards lurking as HN rightfully disincentives "me too" type replies and not everyone always has something interesting to add
2c not an epistemologist ymmv
If your metric is single thread performance yes but on just about anything else Graviton 4 wins.
Individual benchmarks tell the bigger picture. These two are optimized for different use cases, with Apple heavily leaning towards low latency single thread throughput with low sustained power usage.
https://browser.geekbench.com/v6/cpu/compare/16833358?baseli...
EDIT: The M4 Max compares much more closely https://browser.geekbench.com/v6/cpu/compare/16834801?baseli...
https://browser.geekbench.com/v6/cpu/compare/16839304?baseli...
The M3 Ultra sacrifices a bunch of single-thread performance for not that much of a multithreaded gain:
https://browser.geekbench.com/v6/cpu/compare/16839654?baseli...
I am looking for a CPU.
I don't want to confront my users with "Please enter your Apple ID" or any other unexpected messages that I have no control over.
Is Apple M series an option for me?
For better or worse if you make a (high end) consumer CPU it will be judged against the M-series, just like if you make a high end phone it will be judged against the iPhone.
All he is saying: We currently have products in a similar product category (arm based desktop computers) that are widely used and have known benchmark scores (and general reviews) and it would make sense if I publish a new cpu for the same product category ("Reaching Desktop Performance" implies that) that I'd compare it to the known alternatives.
In the end you can just run Asahi on your macbook, the OS is not that relevant here. A comparison to macbooks running Asahi Linux would be fine.
amelius, if anyone had specific requirements, it was you with your "systems for in-flight entertainment".
OP asked a very reasonable question for a very generic comparison to the 800-pound gorilla in the consumer CPU world in general, and ARM CPU world in particular.
If the article can reference AMD's Zen 5 cores and Intel's Lion/Sunny Cove, they could have made at least a brief reference to M-series CPUs. As a reader and potential buyer of any of them, I find it would have been a very useful comparison.
This is not possible with Apple parts.
That's what my example was about. It was only specific because I wanted to have a concrete example.
Talk about specifics, eh? Didn't you just argue against an article addressing "_their_" specific usecase?
In a store people will ask "is this better than an Apple?".
And I'll tell you one more thing, when I was in the industry and taking computing parts to build products with them I did not form an opinion by reading internet reviews. I haven't met anyone who did.
"Starting with computers using macOS 28, Rosetta functionality will be available only for certain older, unmaintained games that rely on Intel-based frameworks."
https://support.apple.com/en-us/102527
And
"Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks."
https://developer.apple.com/documentation/apple-silicon/abou...
Colima is backed by qemu, not Rosetta, so if Rosetta disappeared tomorrow I don't think I'd notice. I'm sure it's "better" but when the competition is "good enough" it doesn't really matter.
I don't see how that's holding you back from using these tools for your work anymore than using a Makita power tool with LXT battery pack.
I learned to live with macOS, but I also like and use Gnome, which many Linux-only people hate. I tried most WMs on Linux, like Hyprland, Sway, i3, but none ever felt worth the config hassle when compared to the sane defaults of Gnome.
I have to admit that when I read this, my eyebrows went up so far that my hat moved.
Yes, Asahi exists, and props to the developers, but I don't think I'm alone in being unwilling to buy hardware from a manufacturer who obviously is not interested in supporting open operating systems
So they don’t actively help (or event make it easy by providing clear docs), but they do still do enough to enable really motivated people
This is an industry blog, not a consumer oriented blog.
The real reason is probably because they are supported by patrons and can only get new equipment to review when people donate (either money or sometimes the hardware itself).
If you like what they do (as pretty much the last in-depth hardware reviewers), consider supporting them.
As in, they don't sell you the parts, they only sell you the entire product. If you don't want the entire package, the processors alone are irrelevant.
The tested machine is an nvidia GB10 which nvidia makes and sells as a whole unit and various vendors stick it in different devices to try to differentiate (although in the end they're all basically identical).
And yes, it is extremely weird for it to never mention the Apple chip, which has a little something to do with who they thank for lending them the device. The arbitrary claims for why they ignored the enormous, class-leading ARM processor in the space is not convincing.
I mean, the other claim that this is an "industry blog" and not a "consumer blog" was equally silly. It's basically for curious hobbyists. Zero industry insiders follow this to see about the core in the GB10. It's basically Anandtech.
A few years ago they were writing articles about Apple Silicon.
Running the SPEC benchmark interger and floating piitnt suites takes all day, but it's hard to game a benchmark with that much depth.
It's a shame that nobody has been willing to offer that level of detail.
But I did some comparisons when I tested the same Dell GB10 hardware late last year: https://www.jeffgeerling.com/blog/2025/dells-version-dgx-spa...
Your “specifically ARM cores designed by and licensable from ARM Holdings” argument doesn’t hold any water.
And Qualcomm.
Anyway, here it is in GB10 form-
https://browser.geekbench.com/v6/cpu/14078585
And here is a comparable M5 in a laptop-
https://browser.geekbench.com/macs/macbook-pro-14-inch-2025
M5 has about a 32% per core advantage, though the DGX obviously has a much richer power budget so they tossed in 10 high performance cores and 10 efficiency cores (versus the 4 performance and 6 efficiency in the latter). Given the 10/10 vs 4/6 core layouts I would expect the former to massively trounce the latter on multicore, while it only marginally does.
Samsung used the same X925 core in their Exynos 2500 that they use on a flip phone. Mediatek put it in a couple of their chips as well.
"Reaching desktop" is always such a weird criteria though. It's kind of a meaningless bar.
"Daily driver" is probably a better term, but everyone's daily usage patterns will vary. I could do my day job with a VT100 emulator on a phone for example.
Only found this which talks about performance-per-area (PPA) and performance-per-clock ()I assume cycle) (PPC): https://www.reddit.com/r/hardware/comments/1gvo28c/latest_ar...
https://www.androidauthority.com/arm-c1-cpu-mali-g1-gpu-deep...
To do this lookup in the first place, you pull a number of bits from the virtual/physical address you're looking up, which tells you what bucket to start at. The minimum page size determines how many bits you can use from these addresses to refer to unique buckets. If you don't have a lot of bits, then you can't count very high (6 bits = 2^6 = 64 buckets) -- so to increase the size of the cache, you need to instead increase the associativity, which makes latency worse. For L1 cache, you basically never want to make latency worse, so you are practically capped here.
Platforms like Apple Silicon instead set the minimum page size to 16k, so you get more bits to count buckets (8 bits = 256 buckets). Thus you can increase the size of the cache while keeping associativity low; L1 cache on Apple Silicon is something crazy like 192kb, and L2 (for the same reasons) is +16MB. x86 machines and software, for legacy reasons, are very much tied to 4k page size, which puts something of a practical limit on the size of their downstream caches.
Look up "Virtually Indexed, Physically Tagged" (VIPT) caches for more info if you want it.
Otoh L1 sizes hasn't increased since my first processor, those CPU designers probably know more than I do.
Divide the cache into "meta-caches" indexed by the virtual bits and treat them as separate from the L2's point of view. Duplicate the data and if somebody writes back invalidate all the other copies. The hardware already exists for doing this on any multicore system. Sure, you will end up duplicating data sometimes and it will actually be slower if you're actually writing to aliased locations. But is this happening often enough to be a problem compared to generally having a bigger cache?
It sounds to me like an engineering tradeoff that might or might not make sense, not a hard limit which at least was what I think was being asserted. But as I also said, L1 sizes hasn't increased in a while and smart people are working on it, so there is probably something I don't know.
(Of course, it only took a few years for this to be rectified with the byte-word extension, which became required by ~all "real software" that supported Alpha)
It's also one of the only architectures Windows NT supported that didn't have 4KB pages, along with Itanium. I've wondered how (or if?) it handled programs that expect 4KB pages, especially in the x86 translation subsystem.
It has some interesting conclusions, such as that it covers certain AVX512 gaps:
"AVX512 plugs many of the holes that SSE had, whilst SVE2 adds more complex operations (such as histogramming and bit permutation), and even introduces new ‘gaps’ (such as 32/64-bit element only COMPACT, no general vector byte left-shift, non-universal predication etc)."
And also that rusty x86 developers might face skill issues:
"Depending on your application, writing code for SVE2 can bring about new challenges. In particular, tailoring fixed-width problems and swizzling data around vectors may become much more difficult when the length is unknown."
This is an article testing shipping hardware you can buy today.
Won't ARM have validation silicon available to their licensees?
Honestly, Apple is the strange one because they never discuss CPUs until they are available to buy in a product; they don't need to bother.
If you want something more perf competitive, pick Dimensity, Exynos, or Snapdragon.
The successor of x925 is C1 Ultra and even that was released 6 months ago in September 2025 with the MediaTek 9500 and GeekerWan even has a phone review they did with that chip last year.
So far I've only found various M cores online. It would be fun to have something to experiment with on a cheapish FPGA like Kintex XC7-K480T, that may have enough resources for some in-order A core, and can be had for $50 or so.
Better favor as much as possible RISC-V implementations.
But, I don't know if there are already good modern-desktop-grade RISC-V implementations (in the US, Sifive is moving fast as far as I know)... and the hard part: accessing the latest and greatest silicon process of TMSC, aka ~5GHz.
Those markets are completely saturated, namely at best, it will be very slow unless something big does happen: for instance AMD adapts its best micro-architecture to RISC-V (ISA decoding mostly), etc.
And if valve start to distribute a client with a strong RISC-V game compilation framework...
worked with their cores in $pastJob. I'd say their main products are flowery promises and long errata sheets.
It took ARM decades to get to where it is, and that involved a long stint in low-margin niche applications like embedded or appliances where x86 was poorly suited due to head and power consumption.
I think you'll see ever more accelerating RISC-V adoption in China if the United States continues on its "cold war" style mentality about relations with them.
That said we're a long long way from Actually Existing RISC-V being at performance parity with ARM64, let alone x86.
The other massive point: RISC-V integrates a lot of CPU "we know now" in a very elegant "sweet spot".
And it is not china only, the best implementations are US, and RISC-V is a US/berkley initiative re-centered in switzerland for "neutrality" reasons.
If good large RISC-V implementations do reach TMSC silicon process (5GHz), some markets won't even look at arm or x86 anymore.
And there is the ultimate "standard ISA" point: assembly written code then become very appropriate, hence strong de-coupling from all those, very few, backdoor injecting compilers.
On many of my personal projects, I don't bother anymore: I write RISC-V assembly which I run with a small x86_64 interpreter, that with a very simple pre-processor and assembler, aka SDK toolchain complexity close to 0 compared to the other abominations.
And I think the main drawback is: big mistakes will be made, and you must account for them.
user-scalable=0
I feel these days however, for any comparison of performance, power envelope needs to be included (I realise this is dependent on the final chip)
While the parent article shows AMD Zen 5 having significantly better results in floating-point SPEC CPU2017, these benchmark results are still misleading, because in properly optimized for AVX-512 applications the difference between Zen 5 and Cortex-X925 would be much greater. I have no idea how SPEC has been compiled by the author of the article, but the floating-point results are not consistent with programs optimized for Zen 5.
One disadvantage of Cortex-X925 is having narrower vector instructions and registers, which requires more instructions for the same task and it is only partially compensated by the fact that Cortex-X925 can execute up to 6 128-bit instructions per clock cycle (vs. up to 4 vector instructions per clock cycle for Intel/AMD, but which are wider, 256-bit for Intel and up to 512-bit for Zen 5). This has been shown in the parent article.
The second disadvantage of Cortex-X925 is that it has an unbalanced microarchitecture for vector operations. For decades most CPUs with good vector performance had an equal throughput for fused multiply-add operations and for loads from the L1 cache memory. This is required to ensure that the execution units are fed all the time with operands in many applications.
However, Cortex-X925 can do at most 4 loads, while it can do 6 FMAs. Because of this lower load throughput Cortex-X925 can reach the maximum FMA throughput only much less frequently than the AMD or Intel CPUs. This is compounded by the fact that achieving better FMA to load ratios requires more storage space in the architectural vector registers, and Cortex-X925 is also disadvantaged for this, by having 4-time smaller vector registers than Zen 5.
The arithmetic intensity of most SPECfp subtests is quite low. You see this wall because it ends up reaching bandwidth limitations long before running out of compute on cores with beefy SIMD.
If there's space between the SIMD instructions, then double-pumping or even quad-pumping isn't very expensive (and with 6 SIMD ports, it might even be basically free).
What most people want is interactivity and fast web pages which doesn't have much to do with wide vector instructions (except possibly for optimized video decoding).
I'd love to have a Xeon 6, a big EPYC, or an AmpereOne (or a loaded IBM LinuxOne Express) as my daily driver, but that's just not something I can justify. It'd not be easy to come up with something for all this compute capacity to do. A reasonable GPU is a much better match for most of my workloads, which aren't even about pushing pixels anymore - iGPUs are enough these days - but multiplying matrices with embarrassingly low precision, so it can pretend to understand programming tasks.