Hexagon 780: A Whole new IP for AI & DSP

Every year Qualcomm likes to talk about its new Hexagon DSPs, with the last few generations also adding to the table new Tensor Accelerators dedicated for ML inferencing. This year’s Snapdragon 888 design also hypes up the new Hexagon 780 design, with the difference being that this time around the generational improvements are truly humongous.

The new Hexagon 780 accelerator IP truly deserves a large increment in its IP numbering scheme, as it’s essentially a ground-up redesign of the company’s existing DSP with scalar and vector execution engines, and the recent Tensor Accelerators. Previously all these execution engines were acting as discrete independent blocks within the Hexagon 600 series family, but that’s now changed in the new IP design.

The new IP block fuses together all the scalar, tensor, and vector capabilities into a single monolithic IP, vastly increasing the performance and power efficiency of workloads that make a use of all the mixed capabilities of the design.

In terms of performance uplifts, scalar execution capabilities are said to be increased by 50%, while tensor execution throughput has doubled. The vector extension units seem to have remained the same this generation, but actual performance of workloads will still have been increased thanks to the new memory architecture of the new IP block.

Qualcomm states that they’ve increased the on-chip SRAM dedicated to the block 16-fold, allowing for larger machine-learning inference models to fit within the block’s memory, greatly accelerating their performance. This larger memory pool also allows for coherency between the scalar, vector and tensor units, allowing for the vastly increased workload handoff time between the different execution engines. I asked about the actual size of this new memory, but the company wouldn’t disclose any further details, just stating that it’s significant.

The company’s engineers were extremely hyped up about the new design, stating that the performance and flexibility of the new design is well beyond that of what other companies can achieve through disaggregated DSP and ML inference engines, sometimes even from different IP vendors.

The most important figure for the new design is the 3x performance per watt claim, which is just a massive generational improvement that you rarely see in the industry.

As is usual for Qualcomm, the company doesn’t actually state the per-block performance increases, but instead opts to showcase an aggregate computational throughput figure shared amongst all of the SoC’s IP blocks, including CPU, GPU, and the new Hexagon accelerator block. This new figure lands in at 26TOPs for the Snapdragon 888, which is 73% higher than the 15TOPs figure of the Snapdragon 865. Given that we’ve seen significant changes in all IP blocks this generation, I won’t attempt a breakdown estimate as it’s likely going to be wrong and off-the-mark anyhow.

The Adreno 660 - A 35% faster GPU

Amongst the improvements which lead up to that 26TOPs figure is a new vastly improved GPU in the form of the new Adreno 660.

Qualcomm still holds architectural details of their GPUs very close to their chest and thus doesn’t go disclose very much about the new GPU design and what has actually changed, but one thing they did talk about is the addition of new mixed-precision dot product as well as FP16/FP32 wave matrix-multiply instructions, which allow the new GPU to increase AI performance by up to 43%.

We’re also seeing the addition of variable rate shading (VRS) onto the Adreno GPU architecture, allowing for coarser pixel shading onto larger pixel blocks for object and screen areas which don’t require as much detail or the native resolution shading wouldn’t be noticeable. This is also a major feature that’s being introduced in the console and new-generation PC graphics cards and GPUs, which should bring greater performance uplift for new gaming titles which take advantage of the new features. It’s great to see Qualcomm bringing this to the mobile space along with the rest of the industry.

For graphics workloads, the new GPU is advertised as being able to increase performance by up to 35%, which is a very major generational performance leap.

Such a performance jump would actually signify that Qualcomm may very well regain the gaming performance crown this generation, having lost it to Apple’s SoCs over the last two generations. Apple’s latest A14 has seen rather conservative gains on the GPU side this year, so a 35% performance gain over the Snapdragon 865 should very much allow the new Snapdragon 888 to retake the leadership position.

A 35% performance increase with a 20% power efficiency increase would indicate that the new SoC would achieve the higher performance at cost of a little higher power consumption, but given the Snapdragon 865’s excellent power characteristics of below 4W, Qualcomm does have a little leeway to increase power this generation.

3200MHz LPDDR5

The new Snapdragon 888 moves from a hybrid memory controller to one that focuses on LPDDR5, and also increases the frequency support for new LPDDR5 to 3200MHz (Or LPDDR5-6400).

For the Snapdragon 865 Qualcomm was rather unenthusiastic about the LPDDR5 switch, saying that it didn’t bring all to great improvements to performance or power efficiency – something which we actually did test out and come to the same conclusion in our review of the two OnePlus 8 phones, where the LPDDR4X variant ended up being no slower and seemingly actually more efficient to us. Apple this year also kept on using LPDDR4X on their A14 and M1 SoCs – pointing out that the benefits aren’t all that great.

For the Snadpragon 888 however, Qualcomm’s engineers seemed more upbeat about LPDDR5 and the new SoC actually being able to utilise the increase memory bandwidth this generation. Without going into details, the company also stated that they’ve improved the overall design of the memory subsystem, improving aspects such as latency.

On the part of the memory subsystem, Qualcomm still employs a 3MB system-level-cache in front of the memory controllers, with the ability of all SoC IP blocks to take advantage of this cache.

The Snapdragon 888: Back to monolithic SoC on 5nm Triple ISPs: Concurrent Triple-Camera Usage
POST A COMMENT

123 Comments

View All Comments

  • dudedud - Wednesday, December 2, 2020 - link

    If the Vivo (V2056A) GB's scores are legit, this implementation of the X1 will be much more close to A12 than to the A13. Reply
  • abufrejoval - Thursday, December 3, 2020 - link

    I'd say they design their SoC to suit their usage. And that is a mixed bag full of compromises, matching what people are actually doing on these devices. Very little of that is HPC.

    The single X1 is for all that fat single-threaded desktop-class browser code out there, that only gets tolerable response times on a 4GHz Pentium 4, but hopefully won't run longer than a couple hundred milliseconds, because an X1 core simply can't run 24x7 on a mobile power budget.

    Mobile games better run on the efficiency cores mainly (apart from the GPU), with perhaps short bursts on the power cores, because otherwise not even an hour worth of game time may be possible on a single charge (or without burning your fingers).

    In short, don't expect all of these resources used at full capacity for any extended time. Instead these SoCs become a computing appliance farm with specialists for many different tasks, designed to do very little to nothing most of the time and as aesthetically pleasing inside as any SME server room that evolved with the business for 20 years.

    To ask for a revolutionary design on a new process from a different fab is perhaps asking just a little bit too much, especially when they need to sell another generation next year.

    For such a bad job I am seriously considering that the 888 may be enough of an upgrade over my current 855 to consider, once they sell these devices at reasonable prices (~€500) and with LinageOS support late 2021 or early 2022 with the 895's imminent arrival.

    Honestly I've stopped asking for more smartphone computing power since the 820, been perfectly happy with energy effiency since the 835 and been waiting for a proper desktop mode since the first DeX on an 800.

    It's hard to sell more when the need doesn't really grow or you can have 500 Watts of desktop power any time you sit done for something serious.
    Reply
  • eastcoast_pete - Wednesday, December 2, 2020 - link

    As for the choice of Samsung's 5 nm LPE for the manufacturing, I suspect it's not just TSMC's capacity that made QC go Sammy. My guess is that Samsung fabbed it for less - that simple. Sort of why NVIDIA chose Samsung's 8 nm for Ampere; they did it for less. Reply
  • abufrejoval - Thursday, December 3, 2020 - link

    With China flexing its muscle over Taiwan, Korea may be more attractive in other ways, too.

    Then I wonder if the 5nm node on Samsung may actually be faster from start to finish with EUV replacing all those multi-masking and multi-patterning steps...
    Reply
  • trivik12 - Wednesday, December 2, 2020 - link

    Thanks for the detailed article. Did Qualcomm go with Samsung as they will be taking process leadership in the near future with 1st GAAFET implementation in 2023. It would be good to see where Samsung 5nm is relative to TSMC.

    Looking at improvements from process and architecture, I feel even ARM is close to hitting the wall.
    Reply
  • brucethemoose - Wednesday, December 2, 2020 - link

    IDK about that. The schedule is *far* from set in stone, with how extreme the physics are. Reply
  • tkSteveFOX - Wednesday, December 2, 2020 - link

    I think ARM wants to keep everyone out of reach of Apple's performance dominance.
    A55 is vastly inferior to the last 3 gens of small cores Apple has used.
    X1 is decent but seeing just a single core in there isn't great.
    Not using 8mb cache is purely driven by greed. It will make the chips cost higher and that's why QC aren't doing it, there are substantial gains from using more cache on a CPU in heavy workloads.
    All these companies want to make money, so they cut costs of their chips and that's why Android will always be behind iOS.
    Apple gives you the most bleeding edge stuff without sacrificing on the chip.
    Reply
  • id4andrei - Wednesday, December 2, 2020 - link

    That is because QC has to sell these chips to OEMs that have to be able to afford them. Apple doesn't have to sell to no one. They have high margins to justify their investment in performance beyond the scope of Xiaomi, LG, Samsung. Reply
  • melgross - Wednesday, December 2, 2020 - link

    ARM doesn’t care about that. They sell designs that are good enough to make them enough money to make a good profit and allow for further work. It’s up to the OEMs to make the changes allowed through the design license to make some improvements. Failing that, companies can get an architectural license as Apple and a few others do, which lets them design cores and subsystems from scratch.

    Both Qualcomm and Samsung tried that for a few years, but failed to come up with good designs. So they went back to licensing designs from ARM.
    Reply
  • Silver5urfer - Wednesday, December 2, 2020 - link

    SD820 is not a failure, it was a full custom design which was not a disaster it was superb and even had the IPC speed higher than 835, that's where all this Kryo started. Then Qcomm moved all of their engg arch teams to Centriq the ever famous most powerful ARM server processor, they axed it even after putting so much R&D in that with Cloudflare marketing. Since then Qcomm never made any custom cores. Only Samsung did with ambitious aims but failed to optimize it for the smartphone.

    And in the end it doesn't even matter, because phones are going to be on parity with A14. Just looking and gaming performance and application performance tests which are real world shows 865 is not even that far vs A14 in some aspects. And Qcomm is putting money where it matters - GPU and 5G.
    Reply

Log in

Don't have an account? Sign up now