The AMD Zen 5 Microarchitecture: Powering Ryzen AI 300 Series For Mobile and Ryzen 9000 for Desktop
by Gavin Bonshor on July 15, 2024 9:00 AM EST- Posted in
- CPUs
- AMD
- Mobile
- Zen 5
- AM5
- XDNA 2
- Ryzen 9000
- Ryzen AI 300
- RDNA 3.5
- Radeon 890M
AMD Ryzen AI 300: XDNA 2 NPU with Up To 50 TOPS
When it comes to the AMD Ryzen AI 300 series for notebooks and laptops, the second biggest advancement from the previous Ryzen 8040 series (Hawk Point) is through the Neural Processing Unit (NPU). AMD, through their acquisition of Xilinx back in 2020, jump-started their NPU development by integrating Xilinx's existing technology, leading to AMD's initial XDNA architecture. With their latest iteration of the architecture, XDNA 2, AMD is further expanding on its capabilities as well as its performance. It also introduces support for the Block Floating Point 16-bit arithmetic approach as opposed to the traditional half-precision (FP16), which AMD claims to combine the performance of 8-bit but with the accuracy of 16-bit.
Looking at how the AMD XDNA architecture differs from the typical design of a multicore processor, the XDNA design must incorporate a flexible compute with an adaptive memory hierarchy. Compared to models of fixed compute or a model based on a static memory hierarchy, the XDNA (Ryzen AI) Engine uses a grid of interconnected AI Engines (AIE). Each engine has been architected to be able to dynamically adapt to the task at hand, including computation and memory resources, which are designed to improve scalability and efficiency.
Touching more on the tiled approach to the AIE, AMD calls this spatial architecture. It is designed to be flexible, while it couples a tiled dataflow structure with programmable interconnection and flexible partitioning. The tiled dataflow structure enables deterministic performance without any cache misses and also enhances memory management. A programmable interconnect substantially decreases the demand for memory bandwidth, which allows it to allocate resources efficiently. The flexible partitioning design used enables real-time performance while being able to accommodate different requirements, from a variety of AI inferencing tasks, including real-time video and audio processing, to content-creation workflows.
The XDNA 2 architecture builds upon the preexisting XDNA architecture and adds even more AI engines to increase throughput. The AMD XDNA 2 implementation in Strix Point has 32 AI engine tiles, which is 12 more than the previous generation. Not only giving more AI engine tiles, the XDNA 2 architecture also has double the number of MACs per tile and 1.6 X more on-chip memory than the previous generation.
All told, AMD is claiming 50 TOPS of NPU performance, which is more than Intel and Qualcomm's current offerings. The debate around the relevancy of using TOPS to measure AI performance is divisive, and Microsoft set the ball rolling on that one by setting the bar for Copilot+ at 40 TOPS.
Not just about trying to outdo the competition on TOPS, but the XDNA 2 architecture is also designed with power efficiency in mind. AMD claims that its XDNA 2 NPU provides 5x the compute capacity at double the power efficiency compared to the NPU used in the Ryzen 7040 Series. This is made possible through various design choices, including the column-based power gating that AMD says it offers significantly better battery life with the ability to simultaneously handle as many as eight concurrent spatial streams when multitasking.
One of the major feature inclusions with the XDNA 2 architecture is support for the Block Floating Point (Block FP16). The simple way to explain what it does is it offers the performance and speed of 8-bit operations, but employs additional tricks to try to bring the precision closer to 16-bit operations. Notably, this is also done without further quantization or reducing the data size being processed.
As with other neural network precision optimizations, the purpose of Block FP16 is to cut down on the amount of computational work required; in this case using 8-bit math without incuring the full drawbacks of stepping down from 16-bit math – namely, poorer results from the reduced precision. Current generation NPUs can already do native 8-bit processing (and 16-bit, for that matter), but this requires developers to either optimize (and quantize) their software for 8-bit processing, or take the speed hit of staying at 16-bit. AI is still a relativley young field, so software developers are sitll working to figure out just how much precision is enough (with that line seeming to repeatedly drop like a limbo bar), but the basic idea is that this tries to let software developers have their cake and eat it, too.
With all of that said, from a technical perspective, Block FP16 (aka Microscaling) is not a new technique in and of itself. But AMD will be the first PC NPU vender to support it, with Intel's forthcoming Lunar Lake set to join them. So while this is a new-to-AMD feature, it's not going to be a unique feature.
As for how Block FP16 works, AMD's own material on the subject is relatively high-level, but we know from other sources that it's essentially a form of fixed point 8-bit computation with an additional exponent. Specifically, Block FP16 uses a shared exponent for all values, rather than each floating point value having its own exponent. For example, rather than a FP16 number having a sign bit, 5-bit exponent, and 10-bit significant, you have an 8-bit exponent that's shared with all numbers, and then an 8-bit significand.
This essentially allows the processor to cheat by processing the unique significands as INT8 (or fixed-point 8-bit) numbers, while skipping all the work on the shared exponent. Which is why Block FP16 performance largely matches INT8 performance: it's fundamentally 8-bit math. But by having a shared exponent, software authors can move the whole number range window for the computation to a specific range, one that would normally be outside of the range offered by the puny exponent of a true FP8 number.
Most AI applications require 16-bit precision, and Block FP16 addresses this requirement by simultaneously bringing high performance and high accuracy to the mobile market, at least from an AI standpoint. This makes Block FP16 a very important component for pushing forward AI technology, and it is something AMD is pushing hard on.
Ultimately, the XDNA 2-based NPU in the Ryzen AI 300 series of mobile chips is really about processing AI workloads and running features such as Microsoft Copilot+ in a more power-efficient manner than using the graphics. And by being able to deliver 8-bit performance and 16-bit precision, that gives developers one more lever to pull to get the most out of the hardware.
The AMD XDNA 2 architecture, which is set to debut with the Ryzen AI 300 series, is going to provide the key to unlocking the AI PC, or at least what Microsoft defines with their 40 TOPS requirement for Copilot+. By bringing Block FP16 into the equation, AMD brings (close to) 16-bit accuracy at 8-bit speed, making it more performant for some AI applications. Altogether, the integrated NPU is slated to offer up to 50 TOPS of compute performance.
AMD was the first x86 SoC vendor to include an NPU within their chips, and with the growing need for on-chip AI solutions to unlock many software features, they're expecting the hardware (and the die space it represents) to be put to good use. The XDNA 2 architecture ensures that AMD remains at the forefront, offering solid levels of performance and combined versatility for the mobile market.
43 Comments
View All Comments
DanNeely - Wednesday, July 17, 2024 - link
That fixes the thread migration problem for single threaded workloads, but is only of limited help with multithreaded ones being a repeat of the old AMD shared FPU design.GeoffreyA - Thursday, July 18, 2024 - link
I hear what you're saying: run away from Bulldozer's approach. I meant it in the sense of combining the four 128-bit vector paths in a single Skymont core to make up 512 bits, similar to Zen 4's manner of joining two 256-bit paths. Granted, Crestmont had only three 128-bit paths, making it problematic.GeoffreyA - Tuesday, July 16, 2024 - link
It is funny, especially considering that fast AVX-512 will become one of the hallmarks of Zen 5. These instructions are being increasingly used in everyday applications. For example, the SVT-AV1 encoder and dav1d decoder. This perhaps explains why Handbrake's performance was so high.Slash3 - Monday, July 15, 2024 - link
Typo in headline, as an FYI.Slash3 - Monday, July 15, 2024 - link
(Also in the related page navigation drop-down)Gavin Bonshor - Monday, July 15, 2024 - link
Thank you!!! I haven't slept, so that one is on me. I appreciate itSlash3 - Monday, July 15, 2024 - link
Cheers, thanks! Appreciate all the hard work.kkilobyte - Monday, July 15, 2024 - link
Good and interesting, but what about the i9-14900KS test redo with Intel Default settings? You told us 2 months ago that you'd redo them, and I think the recent news about possible issues with this CPU make it even more relevant :"Gavin Bonshor - Friday, May 10, 2024 - link
Don't worry; I will be testing Intel Default settings, too. I'm testing over the weekend and adding them in."
So, will this promise be ever fullfilled?
Gavin Bonshor - Monday, July 15, 2024 - link
Email me please kkilobytekkilobyte - Monday, July 15, 2024 - link
Done :)