AMD Ryzen AI 300: XDNA 2 NPU with Up To 50 TOPS

When it comes to the AMD Ryzen AI 300 series for notebooks and laptops, the second biggest advancement from the previous Ryzen 8040 series (Hawk Point) is through the Neural Processing Unit (NPU). AMD, through their acquisition of Xilinx back in 2020, jump-started their NPU development by integrating Xilinx's existing technology, leading to AMD's initial XDNA architecture. With their latest iteration of the architecture, XDNA 2, AMD is further expanding on its capabilities as well as its performance. It also introduces support for the Block Floating Point 16-bit arithmetic approach as opposed to the traditional half-precision (FP16), which AMD claims to combine the performance of 8-bit but with the accuracy of 16-bit.

Looking at how the AMD XDNA architecture differs from the typical design of a multicore processor, the XDNA design must incorporate a flexible compute with an adaptive memory hierarchy. Compared to models of fixed compute or a model based on a static memory hierarchy, the XDNA (Ryzen AI) Engine uses a grid of interconnected AI Engines (AIE). Each engine has been architected to be able to dynamically adapt to the task at hand, including computation and memory resources, which are designed to improve scalability and efficiency.

Touching more on the tiled approach to the AIE, AMD calls this spatial architecture. It is designed to be flexible, while it couples a tiled dataflow structure with programmable interconnection and flexible partitioning. The tiled dataflow structure enables deterministic performance without any cache misses and also enhances memory management. A programmable interconnect substantially decreases the demand for memory bandwidth, which allows it to allocate resources efficiently. The flexible partitioning design used enables real-time performance while being able to accommodate different requirements, from a variety of AI inferencing tasks, including real-time video and audio processing, to content-creation workflows.

The XDNA 2 architecture builds upon the preexisting XDNA architecture and adds even more AI engines to increase throughput. The AMD XDNA 2 implementation in Strix Point has 32 AI engine tiles, which is 12 more than the previous generation. Not only giving more AI engine tiles, the XDNA 2 architecture also has double the number of MACs per tile and 1.6 X more on-chip memory than the previous generation.

All told, AMD is claiming 50 TOPS of NPU performance, which is more than Intel and Qualcomm's current offerings. The debate around the relevancy of using TOPS to measure AI performance is divisive, and Microsoft set the ball rolling on that one by setting the bar for Copilot+ at 40 TOPS.

Not just about trying to outdo the competition on TOPS, but the XDNA 2 architecture is also designed with power efficiency in mind. AMD claims that its XDNA 2 NPU provides 5x the compute capacity at double the power efficiency compared to the NPU used in the Ryzen 7040 Series. This is made possible through various design choices, including the column-based power gating that AMD says it offers significantly better battery life with the ability to simultaneously handle as many as eight concurrent spatial streams when multitasking.

One of the major feature inclusions with the XDNA 2 architecture is support for the Block Floating Point (Block FP16). The simple way to explain what it does is it offers the performance and speed of 8-bit operations, but employs additional tricks to try to bring the precision closer to 16-bit operations. Notably, this is also done without further quantization or reducing the data size being processed.

As with other neural network precision optimizations, the purpose of Block FP16 is to cut down on the amount of computational work required; in this case using 8-bit math without incuring the full drawbacks of stepping down from 16-bit math – namely, poorer results from the reduced precision. Current generation NPUs can already do native 8-bit processing (and 16-bit, for that matter), but this requires developers to either optimize (and quantize) their software for 8-bit processing, or take the speed hit of staying at 16-bit. AI is still a relativley young field, so software developers are sitll working to figure out just how much precision is enough (with that line seeming to repeatedly drop like a limbo bar), but the basic idea is that this tries to let software developers have their cake and eat it, too.

With all of that said, from a technical perspective, Block FP16 (aka Microscaling) is not a new technique in and of itself. But AMD will be the first PC NPU vender to support it, with Intel's forthcoming Lunar Lake set to join them. So while this is a new-to-AMD feature, it's not going to be a unique feature.

As for how Block FP16 works, AMD's own material on the subject is relatively high-level, but we know from other sources that it's essentially a form of fixed point 8-bit computation with an additional exponent. Specifically, Block FP16 uses a shared exponent for all values, rather than each floating point value having its own exponent. For example, rather than a FP16 number having a sign bit, 5-bit exponent, and 10-bit significant, you have an 8-bit exponent that's shared with all numbers, and then an 8-bit significand.

This essentially allows the processor to cheat by processing the unique significands as INT8 (or fixed-point 8-bit) numbers, while skipping all the work on the shared exponent. Which is why Block FP16 performance largely matches INT8 performance: it's fundamentally 8-bit math. But by having a shared exponent, software authors can move the whole number range window for the computation to a specific range, one that would normally be outside of the range offered by the puny exponent of a true FP8 number.

Most AI applications require 16-bit precision, and Block FP16 addresses this requirement by simultaneously bringing high performance and high accuracy to the mobile market, at least from an AI standpoint. This makes Block FP16 a very important component for pushing forward AI technology, and it is something AMD is pushing hard on.

Ultimately, the XDNA 2-based NPU in the Ryzen AI 300 series of mobile chips is really about processing AI workloads and running features such as Microsoft Copilot+ in a more power-efficient manner than using the graphics. And by being able to deliver 8-bit performance and 16-bit precision, that gives developers one more lever to pull to get the most out of the hardware.

The AMD XDNA 2 architecture, which is set to debut with the Ryzen AI 300 series, is going to provide the key to unlocking the AI PC, or at least what Microsoft defines with their 40 TOPS requirement for Copilot+. By bringing Block FP16 into the equation, AMD brings (close to) 16-bit accuracy at 8-bit speed, making it more performant for some AI applications. Altogether, the integrated NPU is slated to offer up to 50 TOPS of compute performance.

AMD was the first x86 SoC vendor to include an NPU within their chips, and with the growing need for on-chip AI solutions to unlock many software features, they're expecting the hardware (and the die space it represents) to be put to good use. The XDNA 2 architecture ensures that AMD remains at the forefront, offering solid levels of performance and combined versatility for the mobile market.

The AMD Zen 5 Microarchitecture: Powering Ryzen AI 300 (Mobile) and Ryzen 9000 Series (Desktop) AMD Ryzen AI 300: RDNA 3.5 Graphics Brings The Visuals
Comments Locked

43 Comments

View All Comments

  • kkilobyte - Friday, July 19, 2024 - link

    Should I expect an answer from you someday? I didn't receive one so far...
  • meacupla - Monday, July 15, 2024 - link

    There's no point doing a retest of i9 13/14th gen right now. They're still failing after the patches have been applied.
  • kkilobyte - Monday, July 15, 2024 - link

    This is difficult to say, isn't it?

    I'm saying this because it seems to me that, for the most part, CPUs that got the patches applied received them rather late in their lifetime, having run under possibly "wrong" settigns for quite a while. So the mitigations in such a case may have come too late to prevent damage from happening.

    I'm wondering if there are statistics about that so far. That would be interesting to see if applying such mitigations changed anything on the "lifetime without failure" of the affected CPU, when applied early on its lifetime.
  • Dante Verizon - Monday, July 15, 2024 - link

    It seems that a 10% downclock solves the problem.
  • Dante Verizon - Monday, July 15, 2024 - link

    I wouldn't count on it. Look who owns this site.
  • kkilobyte - Tuesday, July 16, 2024 - link

    I still believe in the integrity of its journalists. Maybe I'm naive, but I think it is important, else what's the point of reading their articles?
  • Dante Verizon - Thursday, July 18, 2024 - link

    I don't doubt that the benchmark figures here are true, but I also believe that sites that need marketing funds can't be aggressive in pointing out manufacturers' problems.
  • deil - Tuesday, July 16, 2024 - link

    I feel like AI9HX chip will be massive win. I noticed asus put them in vivobooks, with decent otherwise specs, no extra trash gpu, 90W. It feels like the best thing you can get right now, especially thin and light, as "paired" with 4060 is a joke.
    integrated can run like lowest tdp 4060 mobile.
    who in his right mind would want to double the power draw, for no other reasons than NVIDIA sticker and dedicated VRAM.
    If I find any laptop with 2 or 3 slots for ssd, any format, I will take that, if not, I guess that VivoBook it is.
    btw, It's really sad that small form laptops cannot find any space for a second m.2 slot.
    just why. Especially with all those PCIE empty with no gpu....
  • nandnandnand - Tuesday, July 16, 2024 - link

    Haven't checked the prices, but anything without a dGPU ought to be priced less. Unless we're talking Strix Halo next year.
  • Terry_Craig - Tuesday, July 16, 2024 - link

    And you get rid of bugs due to constant switching between dGPU and iGPU.

Log in

Don't have an account? Sign up now