AMD Ryzen AI 300: XDNA 2 NPU with Up To 50 TOPS

When it comes to the AMD Ryzen AI 300 series for notebooks and laptops, the second biggest advancement from the previous Ryzen 8040 series (Hawk Point) is through the Neural Processing Unit (NPU). AMD, through their acquisition of Xilinx back in 2020, jump-started their NPU development by integrating Xilinx's existing technology, leading to AMD's initial XDNA architecture. With their latest iteration of the architecture, XDNA 2, AMD is further expanding on its capabilities as well as its performance. It also introduces support for the Block Floating Point 16-bit arithmetic approach as opposed to the traditional half-precision (FP16), which AMD claims to combine the performance of 8-bit but with the accuracy of 16-bit.

Looking at how the AMD XDNA architecture differs from the typical design of a multicore processor, the XDNA design must incorporate a flexible compute with an adaptive memory hierarchy. Compared to models of fixed compute or a model based on a static memory hierarchy, the XDNA (Ryzen AI) Engine uses a grid of interconnected AI Engines (AIE). Each engine has been architected to be able to dynamically adapt to the task at hand, including computation and memory resources, which are designed to improve scalability and efficiency.

Touching more on the tiled approach to the AIE, AMD calls this spatial architecture. It is designed to be flexible, while it couples a tiled dataflow structure with programmable interconnection and flexible partitioning. The tiled dataflow structure enables deterministic performance without any cache misses and also enhances memory management. A programmable interconnect substantially decreases the demand for memory bandwidth, which allows it to allocate resources efficiently. The flexible partitioning design used enables real-time performance while being able to accommodate different requirements, from a variety of AI inferencing tasks, including real-time video and audio processing, to content-creation workflows.

The XDNA 2 architecture builds upon the preexisting XDNA architecture and adds even more AI engines to increase throughput. The AMD XDNA 2 implementation in Strix Point has 32 AI engine tiles, which is 12 more than the previous generation. Not only giving more AI engine tiles, the XDNA 2 architecture also has double the number of MACs per tile and 1.6 X more on-chip memory than the previous generation.

All told, AMD is claiming 50 TOPS of NPU performance, which is more than Intel and Qualcomm's current offerings. The debate around the relevancy of using TOPS to measure AI performance is divisive, and Microsoft set the ball rolling on that one by setting the bar for Copilot+ at 40 TOPS.

Not just about trying to outdo the competition on TOPS, but the XDNA 2 architecture is also designed with power efficiency in mind. AMD claims that its XDNA 2 NPU provides 5x the compute capacity at double the power efficiency compared to the NPU used in the Ryzen 7040 Series. This is made possible through various design choices, including the column-based power gating that AMD says it offers significantly better battery life with the ability to simultaneously handle as many as eight concurrent spatial streams when multitasking.

One of the major feature inclusions with the XDNA 2 architecture is support for the Block Floating Point (Block FP16). The simple way to explain what it does is it offers the performance and speed of 8-bit operations, but employs additional tricks to try to bring the precision closer to 16-bit operations. Notably, this is also done without further quantization or reducing the data size being processed.

As with other neural network precision optimizations, the purpose of Block FP16 is to cut down on the amount of computational work required; in this case using 8-bit math without incuring the full drawbacks of stepping down from 16-bit math – namely, poorer results from the reduced precision. Current generation NPUs can already do native 8-bit processing (and 16-bit, for that matter), but this requires developers to either optimize (and quantize) their software for 8-bit processing, or take the speed hit of staying at 16-bit. AI is still a relativley young field, so software developers are sitll working to figure out just how much precision is enough (with that line seeming to repeatedly drop like a limbo bar), but the basic idea is that this tries to let software developers have their cake and eat it, too.

With all of that said, from a technical perspective, Block FP16 (aka Microscaling) is not a new technique in and of itself. But AMD will be the first PC NPU vender to support it, with Intel's forthcoming Lunar Lake set to join them. So while this is a new-to-AMD feature, it's not going to be a unique feature.

As for how Block FP16 works, AMD's own material on the subject is relatively high-level, but we know from other sources that it's essentially a form of fixed point 8-bit computation with an additional exponent. Specifically, Block FP16 uses a shared exponent for all values, rather than each floating point value having its own exponent. For example, rather than a FP16 number having a sign bit, 5-bit exponent, and 10-bit significant, you have an 8-bit exponent that's shared with all numbers, and then an 8-bit significand.

This essentially allows the processor to cheat by processing the unique significands as INT8 (or fixed-point 8-bit) numbers, while skipping all the work on the shared exponent. Which is why Block FP16 performance largely matches INT8 performance: it's fundamentally 8-bit math. But by having a shared exponent, software authors can move the whole number range window for the computation to a specific range, one that would normally be outside of the range offered by the puny exponent of a true FP8 number.

Most AI applications require 16-bit precision, and Block FP16 addresses this requirement by simultaneously bringing high performance and high accuracy to the mobile market, at least from an AI standpoint. This makes Block FP16 a very important component for pushing forward AI technology, and it is something AMD is pushing hard on.

Ultimately, the XDNA 2-based NPU in the Ryzen AI 300 series of mobile chips is really about processing AI workloads and running features such as Microsoft Copilot+ in a more power-efficient manner than using the graphics. And by being able to deliver 8-bit performance and 16-bit precision, that gives developers one more lever to pull to get the most out of the hardware.

The AMD XDNA 2 architecture, which is set to debut with the Ryzen AI 300 series, is going to provide the key to unlocking the AI PC, or at least what Microsoft defines with their 40 TOPS requirement for Copilot+. By bringing Block FP16 into the equation, AMD brings (close to) 16-bit accuracy at 8-bit speed, making it more performant for some AI applications. Altogether, the integrated NPU is slated to offer up to 50 TOPS of compute performance.

AMD was the first x86 SoC vendor to include an NPU within their chips, and with the growing need for on-chip AI solutions to unlock many software features, they're expecting the hardware (and the die space it represents) to be put to good use. The XDNA 2 architecture ensures that AMD remains at the forefront, offering solid levels of performance and combined versatility for the mobile market.

The AMD Zen 5 Microarchitecture: Powering Ryzen AI 300 (Mobile) and Ryzen 9000 Series (Desktop) AMD Ryzen AI 300: RDNA 3.5 Graphics Brings The Visuals
Comments Locked

43 Comments

View All Comments

  • HardwareDufus - Tuesday, July 16, 2024 - link

    I would have liked to see an interim RDNA 3.5 for discrete cards too... Maybe a Navi 30 that would have powered the never released Radeon RX 7950 XTX. Might have been a nice stop gap. Maybe it was just too optimized for lower energy and mobile.

    RX 8000 built on RDNA 4 looks to target low/middle of the market around Q1/Q2 2025 with no high end part (assuming Navi 48 is low, Navi 44 is middle).
  • Zoolook13 - Wednesday, July 17, 2024 - link

    What if it's just two sizes of chiplets, I haven't seen that AMD has said they are not targeting high end, we know they have been working on chiplet GPUs and now Nvidia is as well.
  • StormyParis - Friday, July 19, 2024 - link

    I'm on a i5-4570S for my main PC Nothing wrong with it really, except video deconding. Been hankering after an upgrade for a few years: Zen4, then Zen4-G... now it seems I've got to wait for Zen5 even probably Zen5-G but at some point I'll have to pull the trigger before MS takes Windows away from me ^^
    My one concern is USB reliability. I use USB a lot, and last time I dropped AMD because of USB issues. That was 11 yrs ago, but I've read in some places that USB still isn't rock-steady the way it is w/ Intel.

Log in

Don't have an account? Sign up now