Hexagon 780: A Whole new IP for AI & DSP

Every year Qualcomm likes to talk about its new Hexagon DSPs, with the last few generations also adding to the table new Tensor Accelerators dedicated for ML inferencing. This year’s Snapdragon 888 design also hypes up the new Hexagon 780 design, with the difference being that this time around the generational improvements are truly humongous.

The new Hexagon 780 accelerator IP truly deserves a large increment in its IP numbering scheme, as it’s essentially a ground-up redesign of the company’s existing DSP with scalar and vector execution engines, and the recent Tensor Accelerators. Previously all these execution engines were acting as discrete independent blocks within the Hexagon 600 series family, but that’s now changed in the new IP design.

The new IP block fuses together all the scalar, tensor, and vector capabilities into a single monolithic IP, vastly increasing the performance and power efficiency of workloads that make a use of all the mixed capabilities of the design.

In terms of performance uplifts, scalar execution capabilities are said to be increased by 50%, while tensor execution throughput has doubled. The vector extension units seem to have remained the same this generation, but actual performance of workloads will still have been increased thanks to the new memory architecture of the new IP block.

Qualcomm states that they’ve increased the on-chip SRAM dedicated to the block 16-fold, allowing for larger machine-learning inference models to fit within the block’s memory, greatly accelerating their performance. This larger memory pool also allows for coherency between the scalar, vector and tensor units, allowing for the vastly increased workload handoff time between the different execution engines. I asked about the actual size of this new memory, but the company wouldn’t disclose any further details, just stating that it’s significant.

The company’s engineers were extremely hyped up about the new design, stating that the performance and flexibility of the new design is well beyond that of what other companies can achieve through disaggregated DSP and ML inference engines, sometimes even from different IP vendors.

The most important figure for the new design is the 3x performance per watt claim, which is just a massive generational improvement that you rarely see in the industry.

As is usual for Qualcomm, the company doesn’t actually state the per-block performance increases, but instead opts to showcase an aggregate computational throughput figure shared amongst all of the SoC’s IP blocks, including CPU, GPU, and the new Hexagon accelerator block. This new figure lands in at 26TOPs for the Snapdragon 888, which is 73% higher than the 15TOPs figure of the Snapdragon 865. Given that we’ve seen significant changes in all IP blocks this generation, I won’t attempt a breakdown estimate as it’s likely going to be wrong and off-the-mark anyhow.

The Adreno 660 - A 35% faster GPU

Amongst the improvements which lead up to that 26TOPs figure is a new vastly improved GPU in the form of the new Adreno 660.

Qualcomm still holds architectural details of their GPUs very close to their chest and thus doesn’t go disclose very much about the new GPU design and what has actually changed, but one thing they did talk about is the addition of new mixed-precision dot product as well as FP16/FP32 wave matrix-multiply instructions, which allow the new GPU to increase AI performance by up to 43%.

We’re also seeing the addition of variable rate shading (VRS) onto the Adreno GPU architecture, allowing for coarser pixel shading onto larger pixel blocks for object and screen areas which don’t require as much detail or the native resolution shading wouldn’t be noticeable. This is also a major feature that’s being introduced in the console and new-generation PC graphics cards and GPUs, which should bring greater performance uplift for new gaming titles which take advantage of the new features. It’s great to see Qualcomm bringing this to the mobile space along with the rest of the industry.

For graphics workloads, the new GPU is advertised as being able to increase performance by up to 35%, which is a very major generational performance leap.

Such a performance jump would actually signify that Qualcomm may very well regain the gaming performance crown this generation, having lost it to Apple’s SoCs over the last two generations. Apple’s latest A14 has seen rather conservative gains on the GPU side this year, so a 35% performance gain over the Snapdragon 865 should very much allow the new Snapdragon 888 to retake the leadership position.

A 35% performance increase with a 20% power efficiency increase would indicate that the new SoC would achieve the higher performance at cost of a little higher power consumption, but given the Snapdragon 865’s excellent power characteristics of below 4W, Qualcomm does have a little leeway to increase power this generation.

3200MHz LPDDR5

The new Snapdragon 888 moves from a hybrid memory controller to one that focuses on LPDDR5, and also increases the frequency support for new LPDDR5 to 3200MHz (Or LPDDR5-6400).

For the Snapdragon 865 Qualcomm was rather unenthusiastic about the LPDDR5 switch, saying that it didn’t bring all to great improvements to performance or power efficiency – something which we actually did test out and come to the same conclusion in our review of the two OnePlus 8 phones, where the LPDDR4X variant ended up being no slower and seemingly actually more efficient to us. Apple this year also kept on using LPDDR4X on their A14 and M1 SoCs – pointing out that the benefits aren’t all that great.

For the Snadpragon 888 however, Qualcomm’s engineers seemed more upbeat about LPDDR5 and the new SoC actually being able to utilise the increase memory bandwidth this generation. Without going into details, the company also stated that they’ve improved the overall design of the memory subsystem, improving aspects such as latency.

On the part of the memory subsystem, Qualcomm still employs a 3MB system-level-cache in front of the memory controllers, with the ability of all SoC IP blocks to take advantage of this cache.

The Snapdragon 888: Back to monolithic SoC on 5nm Triple ISPs: Concurrent Triple-Camera Usage
POST A COMMENT

123 Comments

View All Comments

  • Alistair - Wednesday, December 2, 2020 - link

    Focus on Camera and AI, with tiny ~10 percent CPU performance improvements, and the exact same 4 little cores for the 4th year. I admit I'm a little disappointed. Last time I paid attention it was all about power and Hercules A88 cores.

    Hopefully we see good $700 flagships from Samsung this year instead of the way overpriced S20 series, maybe the integrated modem will help.
    Reply
  • zeeBomb - Wednesday, December 2, 2020 - link

    Awww yissss. Some of the little things as some people said that could be implemented..but hey I'm all in for the luckiest Snapdragon chipset yet. Reply
  • Raqia - Wednesday, December 2, 2020 - link

    Any thoughts on specific use cases for the hypervisor feature they enabled or any comments from them? It would be nice to just run Linux or Windows apps from my phone attached to a monitor and kb. Reply
  • BedfordTim - Thursday, December 3, 2020 - link

    Samsung Dex might help you. Reply
  • abufrejoval - Thursday, December 3, 2020 - link

    To me that is the most exciting feature of the 888 and one where I'm not sure that Google's Android will pick up on.

    Today mobile phones (and increasingly desktops) are under the control of the ecosystem vendor, all trust and cryptography tied to Apple or Google (or Microsoft). Of course device vendors also want a piece (DeX) but really it's the owner who should be in charge.

    Going forward the number of stakeholders can only increase, there will be governments with vested interests and specific compliance concerns, corporate employers etc.

    So the ability to run a flexible number of enclaves that can be guaranteed not to step or spy on each other will eventually become critical, but also allow to break the stranglehold that Apple and Google currently have on the device you own, but don't control.

    We already have enclaves inside SIM cards and baseband controllers, but they are completely physical, secure that way, but not flexible and affordable to multiply.

    So while I would love to have more details, know if this is like SEV/MKTME on x86 or even better, I don't see how Apple, Google or Microsoft (Pluton!) or even the NSA for that matter, can be motivated to hand the supreme power to you and me, while they can now play in a walled garden we oversee, even if we can't sniff inside.

    In my book I should be able to block conversations between enclaves and their cloud controls, while Apple is pushing the envelope in the opposite direction, hiding their device/command & control-center conversations from owners.

    IHMO that needs to be made painfully illegal, before all the others jump on that bandwagon.
    Reply
  • ZolaIII - Wednesday, December 2, 2020 - link

    This actually looks like a horrible SoC, which QC didn't bother to improve much. Regarding Samsungs 7nm node things are actually other way around. First gen based on HD lib and with EUV whose both better & more efficient than TSMC 7nm without EUV but it cost more. The second gen is actually 7nm with UHD lib while TSMC 5 nm is a fully new node with around 2.2x higher density (to Samsung 7nm UHD) but sadly not significantly more power efficient while it cost get up more than duble. On the other hand Samsung cost went down as density went up for around 50% and thanks to maturity and good yields it's estimated as twice as much gates per a same price.
    I really wanted to see a new gen of Adreno's (for 4 SoC generations now) instead of that this is minor rework (couple new functions) in a same cluster configuration as it's predecessor. Now imagine 2x increase in logic with TSMC 5 nm process at 30% lower speed. I do think quoted 35% increase is in ideal conditions (utilising new stuff which we won't see in the next couple of years) while in reality we will see just a small insignificant increase.
    What's the use of single high performance CPU core which outruns the rest by working 50% IPC? Sure each & every first core on any cluster will be a one to bare the burden of the one where everything is started before workers are deployed but it only needs culpe % more capacity to balance that.
    Seams they didn't done nothing to increase cache coherence and efficiency.
    DSP had seen a real improvement but I don't look at QC Hexogen as something good (stiff, property, hard to get to & not flexible) to the point of thinking such things shouldn't exist. The biggest gain will be a integration of 5G modem which will cut power consumption in half.

    All in all a rather bad job.
    Reply
  • iphonebestgamephone - Wednesday, December 2, 2020 - link

    Its good as long as i get the the 25% and 35% imptovements. Reply
  • ZolaIII - Wednesday, December 2, 2020 - link

    Thing is you will only get around 20~25% improvement in a single (X1) core while in full all CPU core utilisation that will sink to only couple % (thanks to; bus, typology and memory coherence bottlenecks. Early Geekbench results already confirmed this. It will be same regarding GPU, it will get it's 30% advantage but only when new futures are used (which they won't be in a long time). Reply
  • iphonebestgamephone - Wednesday, December 2, 2020 - link

    Oh man, i forgot what they did with the 855, 45% cpu improvement over 845. But that was only on geekbench 4 single core. With maybe 15% on multi. Reply
  • Raqia - Wednesday, December 2, 2020 - link

    In addition to integrating the Modem, they also appear to be integrating WIFI (FastConnect 6900) and Bluetooth on die now. Prior year diagrams seem to have illustrated WIFI and Bluetooth off chip.

    "I really wanted to see a new gen of Adreno's (for 4 SoC generations now) instead of that this is minor rework (couple new functions) in a same cluster configuration as it's predecessor."

    Are you really sure about it being a minor rework? Variable rate shading may be possible or not give as much of a performance boost without some changes to the hardware. The rest is just marketing much like the moniker 888 is.
    Reply

Log in

Don't have an account? Sign up now