CPU Performance

For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.

Here are the single threaded results.

Single Threaded Tests
AMD Ryzen 9 5950X
AnandTech SMT Off
Baseline
SMT On 
y-Cruncher 100% 99.5%
Dwarf Fortress 100% 99.9%
Dolphin 5.0 100% 99.1%
CineBench R20 100% 99.7%
Web Tests 100% 99.1%
GeekBench (4+5) 100% 100.8%
SPEC2006 100% 101.2%
SPEC2017 100% 99.2%

Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.

The multithreaded tests are a bit more diverse:

Multi-Threaded Tests
AMD Ryzen 9 5950X
AnandTech SMT Off
Baseline
SMT On
Agisoft Photoscan 100% 98.2%
3D Particle Movement 100% 165.7%
3DPM with AVX2 100% 177.5%
y-Cruncher 100% 94.5%
NAMD AVX2 100% 106.6%
AIBench 100% 88.2%
Blender 100% 125.1%
Corona 100% 145.5%
POV-Ray 100% 115.4%
V-Ray 100% 126.0%
CineBench R20 100% 118.6%
HandBrake 4K HEVC 100% 107.9%
7-Zip Combined 100% 133.9%
AES Crypto 100% 104.9%
WinRAR 100% 111.9%
GeekBench (4+5) 100% 109.3%

Here we have a number of different factors affecting the results.

Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.

Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.

The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.

Overall

In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.

In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.

For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.

Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.

Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.

Investigating SMT on Zen 3 Gaming Performance (Discrete GPU)
POST A COMMENT

126 Comments

View All Comments

  • quadibloc - Friday, December 4, 2020 - link

    On the one hand, if one program does a lot of integer calculations, and the other does a lot of floating-point calculations, putting them on the same core would seem to make sense because they're using different execution resources. On the other hand, if you use two threads from the same program on the same core, then you may have less contention for that core's cache memory. Reply
  • linuxgeex - Thursday, December 3, 2020 - link

    One of the key ways in which SMT helps keep the execution resources of the core active, is cache misses. When one thread is waiting 100+ clocks on reads from DRAM, the other thread can happily keep running out of cache. Of course, this is a two-edged sword. Since the second thread is consuming cache, the first thread was more likely to have a cache miss. So for the very best per-thread latency you're best off to disable SMT. For the very best throughput, you're best off to enable SMT. It all depends on your workload. SMT gives you that flexibility. A core without SMT lacks that flexibility. Reply
  • Dahak - Thursday, December 3, 2020 - link

    Now I only glanced through the article, but would it make more sense to use a lower core count cpu to see the benefits of SMT as using a higher core count might mean it will use the real core over the smt cores? Reply
  • Holliday75 - Thursday, December 3, 2020 - link

    Starting to think the entire point of the article was this subject is so damn complicated and hard to quantify for the average user that there is no point in trying unless you are running work loads in a lab environment to find the best possible outcome for the work you plan on doing. Who is going to bother doing that unless you work for a company where it makes sense to do so. Reply
  • 29a - Thursday, December 3, 2020 - link

    I wonder what the charts would look like with a dual core SMT processor. I think the game tests would have a good chance of changing. I'm a little surprised the author didn't think to test CPU's with more limited resources. Reply
  • Klaas - Thursday, December 3, 2020 - link

    Nice article. And nice benchmarks.

    Too bad you could not include more real-life test.
    I think that a processor with far less threads (like a 5600X)
    is easier to saturate with real-life software that actually uses threads.
    The Cache Sizes and Memory channels bandwith ratio to core is quite different
    for those 'smaller' processors. That will probably result in different benchmark results...
    So it would be interesting to see what those processors will do, SMT ON versus SMT OFF.
    I don't think the end result will be different, but it could even be a bigger victory for SMT ON.

    Another interesting area is virtualization.
    And as already mentioned in more comments it is very important that the Operating Systems
    will assign threads to the right core or SMT-Core combinations.
    That is even more important in virtualization situations...
    Reply
  • MDD1963 - Thursday, December 3, 2020 - link

    Determining the usefulness of SMT with 16 cores on tap is not quite as relevant as when this experiment might be done with, say, a 5600X or 5800X....; naturally 16 cores without SMT might still be plenty (as even 8 non SMT cores on the 9700K proved) Reply
  • thejaredhuang - Thursday, December 3, 2020 - link

    This would be a better test on a CPU that doesn't have 16 base cores. If you could try it on a 4C/8T part I think the difference would be more pronounced. Reply
  • dfstar - Thursday, December 3, 2020 - link

    The basic benefit of SMT is it allows the processer to hide the impact of long latency instructions on average IPC, since it can switch to new thread and execute those instructions. In this way it is similar to OOO(which leverages speculative execution to do the same) and also more flexible than fine-grained multi-threading. There is an overhead and cost(area/power) due to the duplicated structures in the core that will impact the perf/watt of pure single-threaded workloads, I don't think disabling SMT removes all this impact ... Reply
  • GreenReaper - Thursday, December 3, 2020 - link

    Perhaps not. But at the same time, it is likely that any non-SMT chip that has a SMT variant actually *is* a SMT chip, it is just disabled in firmware - either because it is broken on that chip, or because the non-SMT variant sold better. Reply

Log in

Don't have an account? Sign up now