Storage Benchmark Considerations

Storage benchmarks can be classified into three main categories, each of which serves a different purpose.

Application Benchmarks

Application benchmarks involve real programs and measures performance as perceived by the user. As compared to other benchmark types, a well-designed application benchmark provides the best assessment of overall system performance.  In principle, there's nothing really storage-specific here: you can pick any one component to swap out while holding the rest of the system configuration constant, and end up with a valid test.

An application benchmark can be as simple as performing a common task (eg. booting the OS, launching a game) while using a stopwatch. But that kind of testing is labor-intensive and error-prone. A good suite of automated application benchmarks is a lot of work to assemble. Well-known application benchmark suites include UL's PCMark and BapCo's SYSmark, which represent relatively light desktop usage. The tests don't occupy much disk space and don't perform all that much IO in aggregate. They are also almost completely devoid of the kinds of bursts of intense IO that are where fast storage pays off. As a result, these tests show very little difference in overall scores between the fastest and slowest storage configurations. This is an accurate assessment, but only for the kinds of storage-light workloads represented by these tests.

Additionally, application benchmarks have a fundamental limitation: they tell you almost nothing about why one drive ends up performing better or worse than another. This makes it impossible to reliably extrapolate application benchmark scores to dissimilar workloads. PCMark and SYSmark scores can't help you predict how your system will perform when you start launching lots of virtual machines and filling up the drive.

IO Traces

Storage IO traces are a live recording of the read and write operations performed by a real-world workload. Every read and write operation is logged along with its timestamp, the size of the data transfer and the disk sectors involved. This can add up to quite a bit of data, even when the contents of the data transfers are not logged (ie. the trace only specifies that a 1MB write was performed, but doesn't actually save a copy of that MB of data). IO traces can be analyzed in depth after the fact to see what kind of queue depths were involved, how much IO was sequential vs random, etc. Statistics can also be computed for a subset of the IO operations logged, such as analyzing read operations and write operations separately.

Recorded IO traces can also be played back, replicating the original IO patterns in a highly repeatable manner, and without needing to re-run the applications that originally generated the IO workload. This takes CPU, RAM, and GPU speed largely out of the equation, so it's easier to compare storage performance across different machines. Benchmarks built around storage trace playback focus specifically on storage performance rather than overall system performance/responsiveness, so their scores can exaggerate the performance differences between hardware configurations, as compared to using metrics that measure overall system performance.

Since trace-based storage benchmarks allow a much greater depth of analysis than application benchmark scores, they can provide insight into what specific performance characteristics enable one drive to provide better application-level performance than another.

Synthetic Benchmarks

Lastly, synthetic benchmarks for storage consist of IO patterns generated and measured by special-purpose tools: for example, using CrystalDiskMark to measure 4kB random read performance at a particular queue depth. In one sense, these test are the simplest to describe and understand (at least if the test configuration is relatively simple). However, their relevance to the real world is not so simple. It's easy to configure a synthetic benchmark that is reliable, repeatable, and says nothing useful about real-world performance. This is done deliberately by PR departments, but can also happen accidentally. And changes in SSD technology can make formerly-useful synthetic tests invalid for use on newer drives. On the other hand, sometimes we can use deliberately unrealistic synthetic tests to infer things about the internal workings of a SSD. It's also theoretically possible to configure a synthetic test to generate a workload that statistically resembles real-world IO patterns, with a varied mix of IO sizes, queue depths, and idle times—but it's usually easier to use a trace-based benchmark for this purpose.

Operating Systems

At least until Microsoft delivers DirectStorage for Windows, Linux is the best OS choice for pushing the performance limits of storage hardware with low software overhead. But more importantly, Linux offers much greater transparency and control over hardware, which is what allows us to perform highly automated testing including trying out all the different power management options supported (or not) by SSDs. The synthetic benchmarks in this test suite are all conducted using the Flexible I/O tester (Fio) version 3.25 running on Linux. This tool is very deserving of its name: the dizzying array of options allows it to be used to test all kinds of IO patterns, and this test suite only scratches the surface of what's possible.

Our trace-based tests (AnandTech Storage Bench, and PCMark 10's Storage tests) are Windows-based tests, and any application benchmarks that we make a regular part of our test suite will also be Windows-based. We're using Windows 10 version 20H2.

Testbed Hardware

Most of our new SSD test suite makes use of an AMD Ryzen-based desktop with relatively moderate specs, but providing the PCIe 4.0 support necessary for the latest generation of NVMe SSDs. Our synthetic and trace-based don't require much compute power, so this system gets away with a 6-core processor, B550 chipset, and no GPU—leaving both the PCIe x16 and M.2 slots connected directly to the CPU available for testing at PCIe 4.0 speeds. The boot drive is a Samsung 960 PRO in the M.2 slot that is connected through the B550 chipset and therefore limited to PCIe 3.0 speeds.

AnandTech 2021 Consumer SSD Testbed
CPU AMD Ryzen 5 3600X
Motherboard ASRock B550 Pro
BIOS L1.42
Memory 2x 16GB Kingston DDR4-3200
Software Linux 5.10, FIO 3.25
Windows 10 version 20H2

 

We also have a more high-end Ryzen desktop, provided by Western Digital along with their SN850 SSD. This one uses AMD's 16-core Ryzen 9 3950X, slightly faster RAM, and includes a Radeon RX 580 GPU. Our plan is to use this system for application benchmarks like PCMark 10 and SYSmark 25, but at the moment we are unable to get either one of those tests to reliably run to completion on this system when using recent builds of Windows 10. Once we can find a stable and reasonably up-to-date software and driver configuration for this system, we'll also try using it for some game loading time benchmarks. This system is also planned to be our new enterprise SSD testbed, taking over from our dual-socket Intel Skylake 2U server which is still overkill for most of our storage tests, but lacks PCIe 4.0 support.

AnandTech 2021 Consumer SSD Testbed - Application Benchmarks
CPU AMD Ryzen 9 3950X
Motherboard ASUS ROG Crosshair VIII Hero
Memory 2x 16GB Mushkin DDR4-3600
GPU AMD Radeon RX 580 8GB (XFX)

 

Neither of our new Ryzen testbeds is capable of using the lowest-power PCIe Active State Power Management (ASPM) modes that laptops rely on, so our idle power measurement tests remain on an older Intel Coffee Lake desktop, updated with the latest software, firmware and microcode.

Coffee Lake SSD Testbed for Idle Power
CPU Intel Core i7-8700K
Motherboard Gigabyte Aorus H370 Gaming 3 WiFi
BIOS F14d
Memory 2x 8GB Kingston DDR4-2666
Software Linux 5.10, FIO 3.25

 

Almost all components of these testbeds are run of the mill desktop hardware. To measure power consumption of individual drives, we also have some highly specialized equipment provided by Quarch Technology. Their HD Programmable Power Module is a power supply that provides simultaneous measurement of voltage and current on its 12V and 5/3.3V supply rails, with readings taken every 4 microseconds. The HD PPM is a bit larger than an optical disc drive, and feeds power to the SSD under test using any of several different power injection fixtures.

Quarch has also recently introduced the Power Analysis Module (PAM). This moves the measurement hardware (ADCs, etc.) onto the form factor-specific fixture itself, and relies on the host system to power the drive instead of the PAM serving as a power supply. We still get the same high-precision measurements, but the PAM is now a much smaller fanless box that just handles translation and buffering of the data stream.

The Challenges of SSD Benchmarking Trace Tests: AnandTech Storage Bench and PCMark 10
POST A COMMENT

70 Comments

View All Comments

  • nobozos - Tuesday, February 2, 2021 - link

    One thing that bothers me about benchmarks in general is that they often don't show the statistics normalized against the cost of the thing being measured. For example, I'd like to see iops/$, or GBs/$, or ???/$ in all your tables and charts. I think you've sometime done this in the past, but it should become a regular feature of every review. Reply
  • kepstin - Tuesday, February 2, 2021 - link

    Prices are so volatile in the market (and sometimes even regional) that a static number here doesn't make sense imo. The periodic roundups of recommended drives do take price and performance into account. Reply
  • KarlKastor - Tuesday, February 2, 2021 - link

    @Billy
    Thank you for the detailed test and the explanation of each procedure.

    There is one thing that I am missing in this test. How does a drive perform in heavy and light, if it is 80 or 90% full?
    Is it closer to a fresh drive or closer to full drive?
    Maybe you can run a drive in that precondition. Not as a general test, but just once to show how a drive behaves.
    Reply
  • Oxford Guy - Tuesday, February 2, 2021 - link

    Great article. I particularly agree with the use of 80% full because that's a lot more realistic than empty drive testing. In fact, I would skip empty drive testing and stick with 60% and 80% full tests.

    • Having three Samsung drives out of nine shown seems like an ad for Samsung, even if that wasn't the intention. That Samsung is a popular brand is not a good reason. OCZ used to be popular and the company's bad practices caught up with it.

    • Please test the Inland brand drives. People can find Samsung drive tests all over the Internet. I'm not saying don't test them, of course. I am asking that you provide significantly more added value to your SSD reviews by reviewing drives almost no one else reviews. For instance, I recently purchased the 2 TB Inland Performance Plus drive, which uses the Phison E18 controller. It should provide very good performance but reviews would help.

    Another issue with brands like Inland is firmware updates. Sandforce, the most infamously poor-quality SSD controller outfit, finally (they claimed) fixed a serious bug in their second-generation controller years ago and OCZ released yet another firmware update. Yet, other brands were sold using the controller and the OCZ tool wouldn't recognize them so they could be patched. Sandforce, of course, never bothered to provide a utility for patching these other brands' drives.

    This issue isn't so severe if the consumer just happened to have purchased a Sandforce drive from a vendor that sometimes makes the effort to create patches, like Intel. But, it's really inexcusable to have such a caveat emptor attitude that one doesn't make a strong effort to warn consumers about any risks involved in buying drives from less dominant brands. Phison, for instance, has reportedly been working on improving the firmware for the E18. Will Inland ever receive a patch? I haven't looked much into it but when I did a a few cursory searches about Inland and firmware patches over the years it seemed that it was the typical "off brand" situation — where the drives are stuck forever with their initial firmware.

    That's not such a severe problem if the firmware is decent to begin with (unlike OCZ, which, despite dozens of updates never fixed the Vertex 2 drive at all) — but it's something Anandtech should be and should have been raising awareness about. Your site covered OCZ's bait and switch tactics (when it switched 32-bit NAND in the Vertex 2 for 64-bit NAND, causing the drives to brick randomly — especially when put to sleep), which was great.

    But, unless I missed it I haven't seen any articles about the drawbacks of purchasing SSDs from smaller brands. And, why not put some pressure on the industry to stop enabling companies like Sandforce to not provide utilities to patch their drives (and utilities to un-brick them when they go into 'panic mode'). It was completely inexcusable — the industry silence around that. Sandforce made it much more important to brick the drive when there was a software glitch, no matter how minor, apparently to 'protect its IP'. Shouldn't the consumer's data be considered the priority? Well, they came out with a not-at-all-conflict-of-interest partnership with DriveSavers. That's right — you get the joy of a drive that will brick at any moment and then you can spend thousands to 'protect the vaunted Sandforce IP' and pad its pockets and DriveSavers'.

    The tech press is supposed to protect us from caveat emptor. So, please... start reviewing smaller brands, start providing a bigger picture than the latest from Samsung, and put more pressure on industry players (like Inland) to do the right things, like keeping their drives' firmware current.
    Reply
  • Oxford Guy - Tuesday, February 2, 2021 - link

    Speaking of bad practices, let's take a look at Samsung.

    1. The company breaks industry convention and intentionally confuses consumers by labeling QLC drives "MLC", and TLC drives as well. That's an example of fraud which is, unfortunately, legal.

    There should have been an article from every tech site condemning this. I don't recall seeing even one. You know, it's not too late, either!

    2. The company posted fantasy power consumption figures for drives like the 830 and the tech press and companies like Newegg dutifully posted those specs. Samsung sold a lot of drives based on word of mouth — about how amazingly efficient its drives were, based on those nonsensical power usage claims.

    3. The company released its planar TLC drives in such an under-engineered (half-baked) state that they had to be kludged into frequently rewriting stored data to keep their performance somewhat acceptable. The steady state performance of the 128 GB 840 drive earned particular, fully justified, scorn from HardOCP.
    Reply
  • Kristian Vättö - Tuesday, February 2, 2021 - link

    All SSDs with a Phison controllers are the same - designed and assembled by Phison. Sure, there are some FW differences as every customer can request customisations, but at a high-level an SSD with a Phison controller is a Phison SSD. None of the small brands produce their own SSDs, they simply work with Phison and other similar ODMs who offer turnkey solutions. Anyone can start their own brand if they have enough capital to meet the MOQ requirements.

    It was different 10 years ago when there were numerous incumbent controller and SSD vendors shipping new designs every 6-12 months ago. At that time, it was never sure what to expect and at AT we were more or less a validation partner even. Nowadays there are a few large factories pumping out stuff with different labels.
    Reply
  • Oxford Guy - Tuesday, February 2, 2021 - link

    The Sandforce 2200 controller was used by a bunch of different companies but to my knowledge it’s not possible to patch that bug if one owns one of the smaller brands’ drives. It’s unlikely enough to get OCZ’ utility to recognize its own drives, let alone another vendor’s.

    So, even if the controller is the same and even if the other hardware is standard, is there a standard utility that can be used with any drive made by any brand? Sandforce never seemed to bother to offer anything like that and there were a lot of different brands using its controllers.

    Also, even when a controller is standard the firmware may not be, as in the case of Intel’s Sandforce drives as far as I know.
    Reply
  • Oxford Guy - Tuesday, February 2, 2021 - link

    So my question remains: are all the Inland drives able to be firmware-updated and secure erased?

    Or, are such ‘small brand’ drives locked out of those things?
    Reply
  • rahvin - Tuesday, February 2, 2021 - link

    Why would they offer a tool when they can charge the OEM to produce a branded tool for those drives only?

    There's little incentive for an ODM to provide anything they aren't paid for and their customers aren't the retail buyers, it's the OEM's.
    Reply
  • Billy Tallis - Tuesday, February 2, 2021 - link

    Samsung's over-represented in this article mainly because they're one of the few companies still sampling new SATA drives for review, and I didn't want to have the SATA market segments represented by old 64-layer drives that you can no longer purchase.

    As for the Inland drives: I don't have any easy way to get samples of a large number of their drives. I strongly prefer not wasting time re-testing the same drive with a different brand's sticker. I do plan to soon have full results for E12+TLC, E12S+QLC, E16+TLC, E16+QLC drives in Bench, and I'll be getting an E18 sample soon. They won't all be from the same brand, but the results will be generally representative of the equivalents from other Phison-based brands.

    I also wish the smaller SSD brands did a better job of making firmware updates available. That is definitely a valid reason for preferring some brands over others. It's a little hard to evaluate vendors on the timeliness of their firmware update releases at product launch, and I've never made it a priority to systematically compare vendors on this post-launch.

    Part of why it's been a low priority has been because it seems like firmware updates are generally not as important these days. When a controller is first launched there are often a few updates to optimize performance, but those usually don't have a big impact on the overall standings of a drive. Firmware updates to fix critical bugs seem to be thankfully less common. And for users who really do care about making sure they've got the absolute latest firmware on their Phison drives, you can usually find a way to apply the update using a different vendor's tool—not ideal by any means, but it works.
    Reply

Log in

Don't have an account? Sign up now