Thursday, September 19, 2024

Arm’s Cortex X925 and A725 will bring some of the biggest year-over-year increases in performance yet

Must read

Arm is the company that designs pretty much all of the CPU cores that end up being used in your Android smartphone, but this year, things are a bit different. With Qualcomm switching to its own Oryon Arm-based cores for the Snapdragon 8 Gen 4, there’s a significant number of smartphones that will be released next year without a core designed by Arm powering it. We still expect MediaTek to stick around using them though, and Qualcomm could potentially release a non-flagship SoC using a combination of these cores in the future.


Nevertheless, this year, Arm is announcing a flagship Cortex-X925 core, a Cortex-A725 performance core, and a refreshed Cortex-A520 efficiency core. These cores are compliant with Arm v9.2, and they form the basis of the company’s CSS for Client, previously known as Total Compute Solutions. CSS stands for Compute SubSystems, an acronym previously used in the company’s datacenter-focused Neoverse cores. We’re also seeing a new DynamIQ Shared Unit (DSU) and an improved GPU in the form of the Immortalis G925.


Arm is also, for the first time, packaging these cores as a GDSII to be provided to OEMs. This is essentially a file that can be provided to a fabricator like TSMC or Samsung for immediate production, and those GDSII files take into account any specific quirks or features of the fabricator. Arm says that a big perk of this is that it can improve time to market for companies that are using these cores, but that those companies can simply license the designs and do all of the leg work themselves if they wish, just like with previous Arm cores.

All three of these cores are improved in some way over last year’s models, with the Cortex X925 boasting the biggest gains of all.


Arm Cortex X925: “Making the best even better”

The biggest year-on-year performance of a Cortex X core

Source: Arm


Arm’s X series of cores diverged from its A series a number of years ago, with the philosophy being that it’s a powerful core that is allowed to guzzle a bit more power when it needs it. Typically, chipset makers will only include one or two of these at maximum, as they’re power hungry, even despite the capabilities that they have, too.

This year, that philosophy is still the same, but Arm is boasting big improvements in both AI and regular IPC. The reasoning for that is that Arm contends most AI is still run on CPU cores, especially given the wide variety of AI hardware out there, it’s hard for developers to actually do their AI processing on built-in NPUs. The target frequency of the X925 is 3.6GHz and higher, but interestingly, Arm is talking about this being a scaling platform “for the next generation of AI PCs.”

Arm Cortex X925 Power Performance graph, showing peak performance in different power envelopes

Source: Arm


What this means, according to Arm, is that there is a 30% average uplift across multiple indicators in the CPU cluster. Arm says this core will “fundamentally change Cortex CPUs for years to come.” As well, you’ll get better performance at lower power draws.

As for how Arm is managing to make these improvements, it’s a multi-faceted architectural improvement across the board. For a start in the front-end, the branch predictor of the Cortex X925 can see twice as far into the future, and the accuracy of those branch predictions is significantly improved as well. It has twice the bandwidth for instruction pre-fetching on top of that, which makes for some important gains.

Arm Cortex X925 core enhancements according to Arm

Source: Arm


Other changes include optimizations for decode and dispatch. As an example, it’s a 10-wide CPU, but there’s a higher effective usage through the removal of multiple processing constraints. There’s also an increase in vector bandwidth from 4×128 bits per cycle to 6×128 bits per cycle, and integer ALU pipelines have increased. Arm also says that one of the benefits of providing a GDSII to partners is that Arm can avail of specific improvements a process node can give, meaning that their ALU can do one and two-cycle operations purely because of those node enhancements. Pretty much all of these benefits are for AI.

In the back end are some minor improvements as well, such as an increase in load pipelines from three to four and improvements to out-of-order processing. All of these culminate to be the largest performance uplift in Cortex X history; at least, according to Arm.

Arm Cortex A725: “The workhorse for power-efficient AI throughput”

Arm is targeting gaming and AI with this core

Overview of the Arm Cortex A725 core

Source: Arm


Arm’s X series of cores are typically let run a bit wild and consume as much power as they need, whereas the A series of cores typically aim to balance power consumption against performance. With the Cortex-A725, Arm promises a 35% more performance-efficient core, with increased performance at the same power as the A715 from last year.

As already mentioned, most AI workloads end up running on the CPU, and Arm is aiming the A725 at exactly those workloads. It has an increase in power efficiency of 25% while also making a bunch of optimizations.

  • Register file structure enhancements
  • Increased re-order buffer
  • Increased instruction issue queues
  • New 1MB L2 configuration

Arm Cortex A725 sustained performance graph

Source: Arm


To sum up the improvements of the A725, you can see that it can put out as much performance at significantly lower power levels than last year’s A720. This is where the bulk of work happens on most devices, so any improvements are welcome here.

Arm Cortex A520 Refresh: No need for major changes

Some minor improvements, that’s about it

Arm Cortex A520 Unmatched Power Efficiency changes

Source: Arm

Because of the GDSII implementation of these cores, Arm is promising some improvements for the A520 Refresh in its physical implementation. There’s not much else to really say here, aside from the fact that this low-power core will be even more low-power. The reason as well for this to be a refresh rather than an all-new implementation is the improvements made to other cores can happen quickly, whereas the A5xx series tends to have smaller improvements that can be done over time, so it makes more sense to pack two years of innovation into these cores every couple of years.


Arm also assures me that this is not a “fixed” schedule and that if there was anything significant to bring to these cores that they would be brought at an earlier time, but it’s just the nature of their development cycles that this occurs.

DSU-120: Minor changes, but more of the same

The biggest change is L3 quick nap, which can save a lot of power

Arm Cortex DSU 120 Updated

Source: Arm

The DynamIQ Shared Unit, or DSU, integrates one or more cores with an L3 memory system, control logic, and external interfaces in order to form a multicore cluster. It’s essentially Arm’s fabric that allows all of these cores to communicate with each other and share resources, and as such, it’s a fairly important piece of the puzzle for any chipset maker looking to build a chip with Arm’s core designs.


DSU-120 launched last year with TCS23, and this is an updated version of that same DSU from last year. There’s one major change though, and that’s L3 Quick Nap.

L3 Quick Nap in the DSU-120 is a sophisticated power-saving mode that runs independently, requiring no intervention from chipset manufacturers like MediaTek or Qualcomm to enable or tune. This feature ensures that the L3 cache enters a low-power state when not in use and wakes up automatically as needed. This seamless transition is key to reducing power consumption while maintaining high performance.

Even better is that it has a negligible impact on latency. The design ensures that the wake-up latency is so minimal it can be hidden within the processor’s pipeline stages. By the time access requests start arriving at the L3 cache, it has already been woken up and is fully operational by the time access is required.

Where will the CSS for Client show up?

There are a few likely places

Arm Cortex CSS for Client 2024 AI PC

Source: Arm


We’re expecting that CSS for Client will show up in a few places, but the interesting thing is that Qualcomm’s flagship chips are now likely out of the question. With Arm talking about “AI PC” throughout its presentations, it’s entirely possible we see some of these cores even come to PCs later this year. Arm talked a lot about AI PCs in its presentation, and there are some big improvements over previous cores that will benefit PCs specifically.

Aside from that, it’s likely that MediaTek will continue to use off-the-shelf Arm cores, and maybe even some lower-tier Qualcomm chips will make use of them too. We’ll be waiting to see how these cores fare in chips later this year, but some of these improvements are massively exciting.

Latest article