5nm / 5LPE: What do we know?

Starting off with the biggest new change of this generation, both the Snapdragon 888 and the Exynos 2100 are manufactured on Samsung’s new 5nm process node, which is the biggest unknown in today’s comparison.

What’s important to remember is that although Samsung calls this node 5nm, its design and characteristics are more similar to that of their 7nm node. Key new characteristics of the new node here are the reintroduction of single diffusion breaks (SDB) on an EUV process node, as well as slight changes in the cell libraries of the process.

Advertised PPA Improvements of New Process Technologies
Data announced by companies during conference calls, press briefings and in press releases
  7LPP
vs 10LPE
6LPP
vs 7LPP
5LPE
vs 7LPP
3GAE
vs 7LPP
Power 50% lower 20% 50%
Performance 20% ? 10% 35%
Area Reduction 40% ~9% <20% 40%

Per Samsung’s own numbers, the foundry claims that 5LPE is either 20% lower power than 7LPP, or 10% more performance. These are actually quite important figures to put into context, particularly when we’re comparing designs which are manufactured on TSMC’s process nodes.

In least year’s review of the Galaxy S20 series and the Exynos 990 and Snapdragon 865 SoCs, an important data-point that put things into context was Qualcomm’s Snapdragon 765 SoC which was also manufactured on Samsung’s 7LPP node, and featuring Cortex-A76 cores. In that comparison we saw that The Exynos 990 and the Snapdragon 765’s A76 cores behaves very similarly in terms of power consumption, however they fell behind TSMC-based Cortex-A76 cores by anywhere from 20% to 30%.

In that context, Samsung’s 5LPE process node improving power by 20% would mean they’d only be catching up with TSMC’s 7nm nodes.

Cortex A55 @ 1.8GHz - SPEC2006 456.hmmer Power

An interesting comparison to make – and probably one of the rare ones we’re actually able to achieve today, is the comparison between the Cortex-A55 cores inside of both the Snapdragon 865 and the new Snapdragon 888. Both SoCs feature the same IP cores, clock them at the same 1.8GHz frequency, and both feature the same amount of L2 cache, with their only real difference being their process nodes.

Using SPEC’s 456.hmmer – because it’s a workload that primarily resides in the lower cache hierarchies and thus, we avoid any impact of the possibly different memory subsystem, we can see that both SoCs’ power consumption indeed is almost identical, with performance also being identical with a score of 6.84 versus 6.81 in favour of the new Snapdragon 888.

So at least at first glance, our theory that Samsung’s 5LPE merely just catches up with the power consumption and power efficiency of TSMC’s N7/N7P nodes seems to be valid – at least at these frequencies.

Further interesting data is the voltage curves of the CPUs on the Exynos 2100. I’ve extracted the frequency voltages tables of both my devices, a regular S21 and an S21 Ultra, with the above curves being the better binned chip inside of the smaller S21.

Generationally, Samsung seems to have been able to greatly reduce voltages this generation. On the Cortex-A55 cores, the cores now only require 800mV at 2GHz whilst the Exynos 990 last year in our review unit they required over 1050mV. Similarly, although the comparison isn’t apples-to-apples, the Cortex-A78 cores at 2.5GHz only require 862mV, while the Cortex-A76 cores of the previous generation required also 1050mV.

What’s also very interesting to see is the voltage curves of the Cortex-X1 cores versus the Cortex-A78 cores: they’re both nigh identical to each other, which actually lines up with Arm’s claims that the new X1 cores have the same frequency capabilities as the A78 cores, only being larger and increasing their power consumption linearly in relation to their frequency.

Samsung’s frequency tables indicate that they had been testing the A55 up to 2.6GHz, and the X1 and A78 cores up to 3.2GHz – however voltages here are quite higher and it’s also likely SLSI wouldn’t have been able to achieve similar chip yields.

Unfortunately, I wasn’t able to extract data from my Snapdragon 888 S21 Ultra, so I can’t tell exactly where it falls in terms of voltages compared to the Exynos 2100. One thing I can confirm as being quite different between the two SoCs is that Samsung does actually give the Exynos 2100’s Cortex-X1 core its own dedicated voltage rail and PMIC regulator, while the Snapdragon 888 shares the same voltage rail across the X1 and A78 cores. In theory, that could mean that in more mixed-thread workloads, the Exynos has the opportunity to be more power efficiency than the Snapdragon 888.

Generally, the one thing I want people to take away here is that although Samsung calls this their 5nm node, it’s quite certain that it will not perform the same as TSMC’s 5nm node. Usually we don’t care about density all too much, however performance and power efficiency are critical aspects that effect the silicon and the end-products’ experiences.

The Snapdragon 888 & Exynos 2100 Memory Subsystem & Latency: Quite Different
POST A COMMENT

122 Comments

View All Comments

  • mohamad.zand - Thursday, June 17, 2021 - link

    Hi , thank you for your explanation
    Do you know how many transistors Snapdragon 888 and Exynos 2100 are?
    It is not written anywhere
    Reply
  • Spunjji - Thursday, February 11, 2021 - link

    I'm not an expert by any means, but I think Samsung's biggest problem was always optimisation - they use lots of die area for computing resources but the memory interfaces aren't optimised well enough to feed the beast, and they kept trying to push clocks higher to compensate.

    The handy car analogy would be:
    Samsung - Dodge Viper. More cubes! More noise! More fuel! Grrr.
    Qualcomm / ARM - Honda Civic. Gets you there. Efficient and compact.
    Apple - Bugatti Veyron. Big engine, but well-engineered. Everything absolutely *sings*.
    Reply
  • Shorty_ - Monday, February 15, 2021 - link

    you're right but you also don't really touch why Apple can do that and X86 designs can't. The issue is that uOP decoding on x86 is *awfully* slow and inefficient on power.

    This was explained to me as follows:

    Variable-length instructions are an utter nightmare to work with. I'll try to explain with regular words how a decoder handles variable length. Here's all the instructions coming in:

    x86: addmatrixdogchewspout
    ARM: dogcatputnetgotfin

    Now, ARM is fixed length (3-letters only), so if I'm decoding them, I just add a space between every 3 letters.
    ARM: dogcatputnetgotfin
    ARM decoded: dog cat put net got fin

    done. Now I can re-order them in a huge buffer, avoid dependencies, and fill my execution ports on the backend.

    x86 is variable length, This means I cannot reliably figure out where the spaces should go. so I have to try all of them and then throw out what doesn't work.
    Look at how much more work there is to do.

    x86: addmatrixdogchewspoutreading frame 1 (n=3): addmatrixdogchewspout
    Partially decoded ops: add, , dog, , ,
    reading frame 2 (n=4): matrixchewspout
    Partially decoded ops: add, ,dog, chew, ,
    reading frame 3 (n=5): matrixspout
    Partially decoded ops: add, ,dog, chew, spout,
    reading frame 4 (n=6): matrix
    Partially decoded ops: add, matrix, dog, chew, spout,
    Fully Expanded Micro Ops: add, ma1, ma2, ma3, ma4, dog, ch1, ch2, ch3, sp1, sp2, sp3

    This is why most x86 cores only have a 3-4 wide frontend. Those decoders are massive, and extremely energy intensive. They cost a decent bit of transistor budget and a lot of thermal budget even at idle. And they have to process all the different lengths and then unpack them, like I showed above with "regular" words. They have excellent throughput because they expand instructions into a ton of micro-ops... BUT that expansion is inconsistent, and hilariously inefficient.

    This is why x86/64 cores require SMT for the best overall throughput -- the timing differences create plenty of room for other stuff to be executed while waiting on large instructions to expand. And with this example... we only stepped up to 6-byte instructions. x86 is 1-15 bytes so imagine how much longer the example would have been.

    Apple doesn't bother with SMT on their ARM core design, and instead goes for a massive reorder buffer, and only presents a single logical core to the programmer, because their 8-wide design can efficiently unpack instructions, and fit them in a massive 630μop reorder buffer, and fill the backend easily achieving high occupancy, even at low clock speeds. Effectively, a reorder buffer, if it's big enough, is better than SMT, because SMT requires programmer awareness / programmer effort, and not everything is parallelizable.
    Reply
  • Karim Braija - Saturday, February 20, 2021 - link

    Je suis pas sur si le benchmark SPENCint2006 est vraiment fiable, en plus je pense que ça fait longtemps que ce benchmark est là depuis un moment et je pense qu'il n'a plus bonne fiabilité, ce sont de nouveaux processeurs puissant. Donc je pense que ce n'est pas très fiable et qu'il ne dit pas des choses précises. Je pense que faut pas que vous croyez ce benchmark à 100%. Reply
  • serendip - Monday, February 8, 2021 - link

    "Looking at all these results, it suddenly makes sense as to why Qualcomm launched another bin/refresh of the Snapdragon 865 in the form of the Snapdragon 870."

    So this means Qualcomm is hedging its bets by having two flagship chips on separate TSMC and Samsung processes? Hopefully the situation will improve once X1 cores get built on TSMC 5nm and there's more experience with integrating X1 + A78. All this also makes SD888 phones a bit pointless if you already have an SD865 device.
    Reply
  • Bluetooth - Monday, February 8, 2021 - link

    Why would they skimp on the cache. Was neural engine or something else with higher priority getting silicon? Reply
  • Kangal - Tuesday, February 9, 2021 - link

    I think Samsung was rushing, and its usually easier to stamp out something that's smaller (cache takes alot of silicon estate). Why they rushed was due to a switch from their M-cores to the X-core, and also internalising the 5G-radio.

    Here's the weird part, I actually think this time their Mongoose Cores would be competitive. Unlike Andrei, I estimated the Cortex-X1 was going to be a load of crap, and seems I was right. Having node parity with Qualcomm, the immature implementation that is the X1, and the further refined Mongoose core... it would've meant they would be quite competitive (better/same/worse) but that's not saying much after looking at Apple.

    How do I figure?
    The Mongoose core was a Cortex A57 alternative which was competitive against Cortex A72 cores. So it started as midcore (Cortex A72) and evolved into a highcore implementation as early as 2019 with the S9 when they began to get really wide, really fast, really hot/thirsty. Those are great for a Large Tablet or Ultrabook, but not good properties for a smaller handheld.

    There was a precedence for this, in the overclocked QSD 845 SoCs, 855+, and the subpar QSD 865 implementation. Heck, it goes all the way back to 2016 when MediaTek was designing 2+4+4 core chipsets (and they failed miserably as you would imagine). I think when consumers buy these, companies send orders, fabs design them, etc... they always forget about the software. This is what separates Apple from Qualcomm, and Qualcomm from the rest. You can either brute-force your way to the top, or try to do things more cost/thermal efficiently.
    Reply
  • Andrei Frumusanu - Tuesday, February 9, 2021 - link

    > Unlike Andrei, I estimated the Cortex-X1 was going to be a load of crap, and seems I was right.

    The X1 *is* great, and far better than Samsung's custom cores.
    Reply
  • Kangal - Wednesday, February 10, 2021 - link

    First of all, apologies for sounding crass.
    Also, you're a professional in this field, I'm merely an enthusiast (aka Armchair Expert) take what I say with a grain of salt. So if you correct me, I stand corrected.

    Nevertheless, I'm very unimpressed by big cores: Mongoose M5, to a lesser extent the Cortex-X1, and to a much Much much lesser extent the Firestorm. I do not think the X1 is great. Remember, the "middle cores" still haven't hit their limits, so it makes little sense to go even thirstier/hotter. Even if the power and thermal issues weren't so dire with these big-cores, the performance difference between the middle cores vs big cores is negligible, also there is no applications that are optimised/demand the big cores. Apple's big-core implementation is much more optimised, they're smarter about thermals, and the performance delta between it and the middle-cores is substantial, hence why their implementation works and why it favours compared to the X1/M5.

    I can see a future for big-cores. Yet, I think it might involve killing the little-cores (A53/A55), and replacing it with a general purpose cores that will be almost as efficient yet be able to perform much better to act as middle-cores. Otherwise latency is always going to be an issue when shifting work from one core to another then another. I suspect the Cortex-X2 will right many wrongs of the X1, combined with a node jump, it should hopefully be a solid platform. Maybe similar to the 20nm-Cortex A57 versus the 16nm-Cortex A72 evolution we saw back in 2016. The vendors have little freedom when it comes to implementing the X1 cores, and I suspect things will ease up for X2, which could mean operating at reasonable levels.

    So even with the current (and future) drawbacks of big-cores, I think they could be a good addition for several reasons: application-specific optimisations, external dock. We might get a DeX implementation that's native to Android/AOSP, and combined that with an external dock that provides higher power delivery AND adequate active-cooling. I can see that as a boon for content creators and entertainment consumers alike. My eye is on emulation performance, perhaps this brute-force can help stabilise the weak Switch and PS2 emulation currently on Android (WiiU next?).
    Reply
  • iphonebestgamephone - Monday, February 15, 2021 - link

    The improvement with the 888 in damonps2 and eggns are quite good. Check some vids on youtube. Reply

Log in

Don't have an account? Sign up now