Launching the #CPUOverload Project: Testing Every x86 Desktop Processor since 2010
by Dr. Ian Cutress on July 20, 2020 1:30 PM ESTOne of the most visited parts of the AnandTech website, aside from the reviews, is our benchmark database Bench. Over the last decade we've placed in there as much benchmark data as we can for every sample we can get our hands on: CPU, GPU, SSD, Laptop and Smartphone being our key categories. As the Senior CPU editor here at AnandTech, one of my duties is to maintain the CPU part of Bench, making sure the benchmarks are relevant and the newest components are tested with benchmark data up to date as much as possible. Today we are announcing the start of a major Bench project with our new Benchmark suite, and some very lofty goals.
What is Bench?
A number of our regular readers will know Bench. We placed a link to easily access it at the top of the page, although given the depth of content it holds, is an understated part of AnandTech. Bench is the centralized database where we place all of the benchmark data we gather for processors, graphics, storage, tablets, laptops and smartphones. Internally Bench has many uses, particularly when collating review data to generate our review graphs, rather than manually redrawing full data sets for each review or keeping datasets offline.
But the biggest benefit with Bench is to either compare many products in one benchmark, or compare two products across all our benchmark tests. For example, here are the first few results of our POV-Ray test.
At its heart, Bench is a comparison tool, with the ability to square two products off side by side can be vital when choosing which one to invest in. Rather than just comparing specifications, Bench provides real world data, offering 3rd party independent verification of data points. In contrast to the benchmarks other companies that are invested in selling you the product might provide, we try and create benchmarks that actually mean something, rather than just list the synthetics.
The goal of Bench has always been a regressive comparison, comparing what the user has to what the user might be looking at purchasing. As a result of a decade of data, that 3-5 year generational gap of benchmark information can become vital to actually quantifying how much of an upgrade a user might receive on the CPU alone. It all depends on what products already have benchmark data in the database, and if the benchmarks are relevant to the workflow (web, office, rendering, gaming, workstation tests, and so on).
Bench: The Beginning
Bench originally started over a decade ago by the founder of AnandTech, Anand. On the CPU side of the database, he worked with both AMD and Intel to obtain a reasonable number of the latest CPUs of the day, and then spent a good summer testing them all. This happened back when Core 2 and Athlons were running the market, with a number of interesting comparisons. The beauty of the Bench database is that all the data from the 30 or so processors Anand tested way back then still exists today, with the core benchmarks of interest to the industry and readership at the time.
With AMD and Intel providing the processors they did, testing every processor became a focal point for the data: it allowed users to search for their exact CPU, compare it to other models in the same family that differ on price, or compare the one they already have to a more modern component they were thinking of buying.
As the years have progressed, Bench has been updated with all the review samples we could obtain and have time to put through the benchmarks. When a new product family is launched however, we rarely get to test them all - unfortunately official sampling rarely goes beyond one or two of the high end products, or if we were lucky, maybe a few more. While we’ve never been able to test full processor stacks from top to bottom, we have typically been able to cover the highlights of a product range, and it has still allowed users to perform general comparisons using the data and for users looking to upgrade their three year old components.
Two main factors have always inhibited the expansion of Bench.
Bench Problem #1: Actually Getting The Hardware
First, the act of sourcing the components can be a barrier to obtaining benchmark data. If we do not have the product, we cannot run the benchmarks! Intel and AMD (and VIA, back in the day) have had different structures for sampling their products, depending on how much they want to say, the release time frame, and the state of the market. Other factors can include the importance of certain processors to the financials of a company, or level of the relationship between us and the manufacturers. Intel and AMD will only work with review websites at any depth if the analysis is fair, and our readers (that’s you) would only read the data if the analysis was unbiased as well.
When it comes down to the base media sampling strategies, companies can typically take two routes. The nature of the technology industry is down to Press Relations (PR), and most companies will have both internal PR departments and also outsource local PR to companies that specialize in that region. Depending on the product, sampling can occur either direct from the manufacturer or via the local PR team, and the sampling strategy will be pre-determined at a much higher level: how many media websites are to be sampled, how many samples will be distributed to each region etc. For example, if a product is going to be sampled via local PR only, there might only be 3-5 units for 15+ technology media outlets, requiring that the samples be moved around when they have been tested. Some big launches, or depending on the relationship between the media outlet with the manufacturer, will be managed from the company internal global PR team, where samples are provided in perpetuity: essentially on long-term loans (which could be recalled).
For the x86 processor manufacturers, Intel and AMD are the players we work with. Of late, Intel’s official media sampling policy provides the main high-end processor in advance of the processor release, such as the i7-4770K, or the i7-6700K. On rare occasions, one of the lower down parts down the stack are provided at the same time, or made available for sampling after the launch date. For example, with the latest Comet Lake, we were sampled both the i9-10900K and the i5-10600K, however these are both high-impact overclockable CPUs. This typically means that if there's an interesting processor down the stack, such as an i3-K or a low cost Pentium, then we have to work with other partners to get a sample (such as motherboard manufacturers, system integrators, or OEMs), or outright purchase it internally.
For AMD’s processors, as demonstrated over the last 4-5 years, the company does not often release a full family stack of CPUs at one time. Instead, processors are launched in batches, with AMD choosing to do two or three every few months. For example, AMD initially launched Ryzen with the three Ryzen 7 processors, followed by four Ryzen 5 processors a few weeks later and finally two Ryzen 3 parts. With the past few generations from AMD, depending on how many processors are in the final stack of CPUs, AnandTech is usually sampled most of them, such as with 1st Gen Ryzen where we were sampled all of them. Previously with the Richland and Trinity processors, only around half the stack were initially offered for review, and less chance of being sampled for the lower value parts, or some parts were offered through local PR teams a couple of months after launch. AMD still today launches OEM parts for specific regions - it tends not to sample those to press either, especially if the press are not in the region for that product.
With certain processors, they target certain media organizations that prioritize different elements of testing, which lends to an imbalance of which media get which CPUs. Most manufacturers will rate the media outlets they work with into tiers, with the top tier ones getting earlier sampling or more access to the components. The reason for this is that if a company sampled everyone everything every time, suddenly 5000 media outlets (and anyone who wants to start a component testing blog) would end up with 10-25 products on their doorstep every year and it would be a mammoth task to organize (for little gain from the outlets with fewer readers).
The concept of tiering is not new – it depends on the media outlets readership reach, the demographic, and the ability to understand the nuance of what is in their hands. AMD and Intel can't sample everyone everything, and sometimes they have specific markets to target, which will also shift focus on who will get what samples. A website focused on fanless HTPCs for example would not be a preferred sampling vector for workstation class processors. At AnandTech, we cover a broad range of topics, have educated readers, and have been working with Intel and AMD for twenty years. On the whole, we generally do well when it comes to processor sampling, although there are still limits - going out and asking for a stack of next generation Xeon Gold CPUs is unlikely to be as simple as overnight shipping.
Bench Problem #2: The March of Time
The second problem with the benchmark database is timing and benchmarks. This comes down to manpower – how many people are running the benchmarks, and the timeframes for which the benchmarks we do test remain relevant for the segments of our readers that are interested in the hardware.
Take graphics card testing, for example: GPU drivers change monthly, and games are updated every few months (and the games people are playing also change). To keep a healthy set of benchmark data, it requires retesting 5 graphics cards per GPU vendor generation, 4-5 generations of GPU launches, from 3-4 different board partners, on 6-10 games every month at three different resolutions/settings per game (and testing each combination enough to be statistically accurate). That takes time, significant effort, and manpower, and I’m amazed Ryan has been able to do so much in the little time he has being the Editor-in-Chief. Picking the highest numbers out of those ranges gives us 5 (GPUs) x 2 (vendors) x 5 (generations) x 4 (board partners) x 10 (games) x 3 (resolutions) x 4 (statistically significant) results, which comes to 24000 benchmark runs, out of the door each month, in an ideal scenario. You could be halfway through and someone issues a driver update, making the rest of the data for naught. It’s not happening overnight, and arguably that could be work for at least one full time employee if not two.
On the CPU side of the equation, the march of time is a little slower. While the number of CPUs to test can be higher (100+ consumer parts in the last few generations), the number of degrees of freedom is smaller, and the rate of our CPU benchmark refresh cycles can be longer. These parameters depend on OS updates and drivers like the GPU testing, but it means that some benchmarks can still be relevant several years later with the same operating system base. 30 year old legacy Fortran code still in use is likely going to stay 30 year old legacy Fortran code in the near future. Or even benchmarks like CineBench R15 are still quoted today, despite the Cinema4D software on which it is based is several generations newer. The CPU testing ends up ultimately limited by the gaming tests, and depends on which modern GPUs are used, what games are being tested, what resolutions are relevant, or when new benchmarks enter the fray.
When Ryan retests a GPU, he has a fixed OS, system ready to go, updates the drivers, and puts the GPU back into the slot. Preparing a new CPU platform for new benchmarks means rebuilding the full system, reinstalling the OS, reinstalling the benchmark suite, and then testing it. However, with the right combination of hardware and tests, a good set of data can last 18 months or so without significant updates. The danger is that whenever there is a full benchmark refresh, which especially revolves around updates to newer operating systems. Due to how OS updates and scheduling with the software stack effects the new operating system, all the old data cannot be compared and the full set of hardware has to be retested on the new OS with an updated benchmark suite.
With our new CPU Overload project (stylized as #CPUOverload in our article titles, because social media is cool?), the aim is to get around both of these major drawbacks.
What is #CPUOverload?
The seeds of this project were initially sown several years ago in 2016. Despite having added our benchmark data to Bench for several years, I had kind of known our Benchmark database was a popular tool, but I didn't really realize how much it was used, or more precisely, under optimized, until recently when I was given access to be able to dig around in our back-end data.
Everyone shopping for a processor wants to know how good the one they're interested in is, and how much of a jump in performance they'll get from their old part. Reading reviews is all well and good, but due to style and applicability, only a few processors are directly compared in a review to a different part specifically, otherwise the review could be a hundred pages long. There have been many times where Ryan has asked me to scale back from 30000 data points in a review!
Also it’s worth noting that reviews are often not updated with newer processor data, as there would be a factual disconnect with the textual analysis underneath.
This is why Bench exists. We often link in each review to Bench and request users go there to compare other processors, or for legacy benchmarks / benchmark breakdowns that are not in the main review.
But for #CPUOverload, with the ongoing march of Windows 10, and the special features therein (such as enabling Speed Shift on Intel processors, new scheduler updates for ACPI 6.2, and having the driver model to support DX12), it has been getting time for us to update our CPU test suite. Our recent reviews were mostly being criticized for still using older hardware, namely the GTX 1080s that I was able to procure, along with having some tests that didn’t always scale with the CPU. (It is worth noting that alongside sourcing CPUs for testing, sourcing GPUs is somewhat harder - asking a vendor or the GPU manufacturer for two or three or more of the same GPU without a direct review is a tough ask.) The other angle is that in any given month, I will get additional requests to benchmark specific CPU tests – users today would prefer seeing their workload in action for comparison, rather than general synthetics, for obvious reasons.
There is also a personal question of user experience on Bench, which has not aged well since our last website layout update in 2013.
In all, the aims of CPU Overload are:
- Source all CPUs. Focus on ones that people actually use
- Retest CPUs on Windows 10 with new CPU tests
- Retest CPUs on Windows 10 with new Gaming tests
- Update the Bench interface
For the #CPUOverload project, we are testing under Windows 10, with a variety of new tests, including AI and SPEC, with new gaming tests on the latest GPUs, and more relevant real world benchmarks. But the heart of CPU Overload is this:
110 Comments
View All Comments
Smell This - Monday, July 20, 2020 - link
;- )
Oxford Guy - Monday, July 20, 2020 - link
"If there’s a CPU, old or new, you want to see tested, then please drop a comment below."• i7-3820. This one is especially interesting because it had roughly the same number of transistors as Piledriver on roughly the same node (Intel 32nm vs. GF 32 nm).
• 5775C
• 5675C (which outperformed and matched the 5775C in some games due to thermal throttling)
• 5775C with TDP bypassed or increased if this is possible, to avoid the aforementioned throttling
• I would really really like you to add Deserts of Kharak to your games test suite. It is the only game I know of that showed Piledriver beating Intel's chips. That unusual performance suggests that it was possible to get more performance out of Piledriver if developers targeted that CPU for optimization and/or the game's engine somehow simply suited it particularly.
• 8320E or 8370E at 4.7 GHz (non-turbo) with 2133 CAS 9-11-10 RAM, the most optimal Piledriver setup. The 9590 was not the most performant of the FX line, likely because of the turbo. A straight overclock coupled with tuned RAM (not 1600 CAS 10 nonsense) makes a difference. 4.7 GHz is a realistic speed achievable by a large AIO or small loop. If you want air cooling only then drop to 4.5 Ghz but keep the fast RAM. The point of testing this is to see what people were able to get in the real world from the AMD alternative for all the years they had to wait for Zen. Since we were stuck with Piledriver as the most performant Intel alternative for so so many years it's worth including for historical context. The "E" models don't have to be used but their lower leakage makes higher clocks less stressful on cooling than a 9000 series. 4.7 GHz was obtainable on a cheap motherboard like the Gigabyte UD3P, with strong airflow to the VRM sink.
• VIA's highest-performance model. If it won't work with Windows 10 then run the tests on it with 8.1. The thing is, though... VIA released an update fairly recently that should make it compatible with Windows 10. I saw Youtube footage of it gaming, in fact, with a discrete card. It really would be a refreshing thing to see VIA included, even though it's such a bit player.
• Lynnfield at 3 GHz.
• i7-9700K, of course.
Oxford Guy - Monday, July 20, 2020 - link
Regarding Deserts of Kharak... It may be that it took advantage of the extra cores. That would make it noteworthy also as an early example of a game that scaled to 8 threads.Oxford Guy - Monday, July 20, 2020 - link
Also, the Chinese X86 CPU, the one based on Zen 1, would be very nice to have included.Oxford Guy - Monday, July 20, 2020 - link
VIA CPUs tested with games as recently as 2019 (there was another video of the quad core but I didn't find it today with a quick search):https://www.youtube.com/watch?v=JPvKwqSMo-k
https://www.youtube.com/watch?v=Da0BkEW459E
The Zhaoxin KaiXian KX-U6880A would be nice to see included, not just the Chinese Zen 1 derivative.
Oxford Guy - Monday, July 20, 2020 - link
"due to thermal throttling"TDP throttling, to be more accurate. I suppose it could throttle due to current demand rather than temp.
axer1234 - Monday, July 20, 2020 - link
honestly i would love to know how different generation processor perform today especially higher core count. like prescott series pentium 4 athlon II phenomX6 core2 duo core2quad nehlam sandy bridge bulldozer etc with todays generation work loads and offeringin many scenario like word excel ppt photoshop it all works very well still in many offices
its just the new generation of application slowing it down for almost the same work etc
herefortheflops - Monday, July 20, 2020 - link
@Dr. Cutress.,As someone that has been dealing with similar or greater product testing challenges and configuration complexity for the better part of a decade or so, I would like to commend you for your ambitious goals and efforts so far. Additionally, I could be of high value to your effort if you are willing to discuss. I have reviewed in-depth the bench database (as well as competing websites) and I have come to the conclusion the Anandtech bench data is of very limited usefulness at present--and would require some significant changes to the data being collected/reported and the way things have been done to this point. I do understand where the industry is going, the questions the readers are going to be asking of the data, and the major comparisons that will be attempted with the data. Unfortunately, much of your effort may easily become irrelevant unless you proceed with some extreme caution to provide data with more utility. I also know methods to accomplish the desired result while reducing the size and cost of the task at hand. Reply by e-mail if you are interested in talking.
Best,
-A potential contributor to your effort.
Bensam123 - Tuesday, July 21, 2020 - link
Despite how impressive this is, one thing that hasn't been tackled is still multiplayer performance and it vastly changes recommendations for CPUs (doesn't effect GPUs as much).It goes from recommending a 6 core chip hands down to trying to make a case for 4 core chips still in this day and age. I own a 3900x and 2800 and I can tell you hands down Modern Warfare will gobble 70% of that 12 core chip, sometimes a bit more, that's equivalent to maxing out a 8 core of the same series. That vastly changes recommendations and data points. It's not just Modern Warfare. Overwatch, Black Ops 3(same engine as MW), and recently Hyper Scape will will make use of those extra cores. I have a widget to monitor CPU utilization in the background and I can check Task Manager. If I had a better video card I'm positive it would've sucked down even more of those 12 cores (my GPU is running at 100% load according to MSI AB).
This is a huge deal and while I understand, I get it, it's hard to reliably reproduce the same results in a multiplayer environment because it changes so much and generally seen as taboo from a hardware benchmarking standpoint, it is vastly different then singleplayer workloads to the point at which it requires completely different recommendations. Given how many people are making expensive hardware choices specifically because they play multiplayer games, I would even say most tech reviews in this day and age are irrelevant for CPU recommendations outside of the casual single player gamer. GPU recommendations are still very much on par, CPU is not remotely.
I talk about this frequently on my stream and why I still recommended the 1600 AF even when it was sitting at $105-125, it's a steal if you play multiplayer games, while most people that either read benchmarking websites or run benchmarks themselves will start making a case for a 4c Intel. 6 core is a must at the very least in this day and age.
Anandtech it's time to tread new ground and go into the uncharted area. Singleplayer results and multiplayer results are too different, you can't keep spinning the wheel and expect things to remain the same. You can verify this yourself just by running task manager in the background while playing one of the games I mentioned at the lowest settings regardless of being able to repeat those results exactly you'll see it's definitely a multi-core landscape for newer multiplayer games.
Not even touched on in the article.
Bensam123 - Tuesday, July 21, 2020 - link
70%, I have SMT off for clarification.