AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

Name: AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance
Item: AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

by Dr. Ian Cutress & Andrei Frumusanu on March 15, 2021 11:00 AM EST

120 Comments | Add A Comment

120 Comments

Disclaimer June 25^th: The benchmark figures in this review have been superseded by our second follow-up Milan review article, where we observe improved performance figures on a production platform compared to AMD’s reference system in this piece.

Section by Ian Cutress

The arrival of AMD’s 3^rd Generation EPYC processor family, using the new Zen 3 core, has been hotly anticipated. The promise of a new processor core microarchitecture, updates to the connectivity and new security options while still retaining platform compatibility are a good measure of an enterprise platform update, but the One True Metric is platform performance. Seeing Zen 3 score ultimate per-core performance leadership in the consumer market back in November rose expectations for a similar slam-dunk in the enterprise market, and today we get to see those results.

AMD EPYC 7003: 64 Cores of Milan

The headline numbers that AMD is promoting with the new generation of hardware is an increase in raw performance throughput of +19%, due to enhancements with the new core design. On top of this, AMD has new security features, optimizations for different memory configurations, and updated performance with the Infinity Fabric and connectivity.

3rd Gen EPYC

Anyone looking for the shorthand specifications on the new EPYC 7003 series, known by its codename Milan, will see great familiarity with the previous generation, however this time around AMD is targeting several different design points.

Milan processors will offer up to 64 cores and 128 threads, using AMD’s latest Zen 3 cores. The processor is designed with eight chiplets of eight cores each, similar to Rome, but this time all eight cores in the chiplet are connected, enabling an effective double L3 cache design for a lower overall cache latency structure. All processors will have 128 lanes of PCIe 4.0, eight channels of memory, with most models supporting dual processor connectivity, and new options for channel memory optimization are available. All Milan processors should be drop-in compatible with Rome series platforms with a firmware update.

AMD EPYC: Generation on Generation
AnandTech	EPYC 7001	EPYC 7002	EPYC 7003
Codename	Naples	Rome	Milan
Microarchitecture	Zen	Zen 2	Zen 3
Core Manufacturing	14nm	7nm	7nm
Max Cores/Threads	32 / 64	64 / 128	64 / 128
Core Complex	4C + 8MB	4C + 16MB	8C + 32MB
Memory Support	8 x DDR4-2666	8 x DDR4-3200	8 x DDR4-3200
Memory Capacity	2 TB	4 TB	4 TB
PCIe	3.0 x128	4.0 x128	4.0 x128
Security	SME SEV	SME SEV	SME SEV SNP
Peak Power	180 W	240 W*	280 W
*Rome introduced 280 W for special HPC mid-cycle

One of the highlights here is that the new generation of processors will offer 280 W models to all customers – previous generations had only 240 W models for all and then 280 W for specific HPC customers, however this time around all customers can enable those high performance parts with the new core design.

This is exemplified if we do direct top-of-stack processor comparisons:

2P Top of Stack GA Offerings
AnandTech	EPYC 7001	EPYC 7002	EPYC 7003	Intel Xeon
Processor	7601	7742	7763	6258R
uArch	Zen	Zen 2	Zen 3	Cascade
Cores	32	64	64	28
TDP	180 W	240 W	280 W	205 W
Base Freq	2200	2250	2450	2700
Turbo Freq	3200	3400	3500	4000
L3 Cache	64 MB	256 MB	256 MB	37.5 MB
PCIe	3.0 x128	4.0 x128	4.0 x128	3.0 x48
DDR4	8 x 2666	8 x 3200	8 x 3200	6 x 2933
DRAM Cap	2 TB	4 TB	4 TB	1 TB
Price	$4200	$6950	$7890	$3950

The new top processor for AMD is the EPYC 7763, a 64-core processor at 280 W TDP offering 2.45 GHz base frequency and 3.50 GHz boost frequency. AMD claims that this processor offers +106% performance in industry benchmarks compared to Intel’s best 2P 28-core processor, the Gold 6258R, and +17% over its previous generation 280 W version the 7H12.

Peak Performance vs Per Core Performance

One of AMD’s angles with the new Milan generation is going to be targeted performance metrics, with the company not simply going after ‘peak’ numbers, but also taking a wider view for customers that need high per-core performance as well, especially for software that is invariably per-core performance limited or licensed. With that in mind, AMD’s F-series of ‘fast’ processors is now being crystallized in the stack.

AMD EPYC 7003 F-SeriesProcessors
	Cores Threads	Base Freq	Turbo Freq	L3 (MB)	TDP (W)	Price
F-Series
EPYC 75F3	32 / 64	2950	4000	256 ( 8 x 32 )	280 W	$4860
EPYC 74F3	24 / 48	3200	4000		240 W	$2900
EPYC 73F3	16 / 32	3500	4000		240 W	$3521
EPYC 72F3	8 / 16	3700	4100		180 W	$2468

These processors have the peak single threaded values of anything else in AMD’s offering, along with the full 256 MB of L3 cache, and in our results get the best scores on a per-thread basis than anything else we’ve tested for enterprise across x86 and Arm – more details in the review. The F-series processors will come at a slight premium over the others.

AMD EPYC: The Tour of Italy

The first generation of EPYC was launched in June 2017. At that time, AMD was essentially a phoenix: rising from the ashes of its former Opteron business, and with a promise to return to high-performance compute with a new processor design philosophy.

At the time, the traditional enterprise customer base were not initially convinced – AMD’s last foray into the enterprise space with a new generation of paradigm-shifting processor core, while it had successes, fell flat as AMD had to stop itself from going bankrupt. Opteron customers were left with no updates in sight at the time, and to the willingness to jump on an unknown platform from a company that had stung so many in the past was not a positive prospect for many.

At the time, AMD put out a three year roadmap, detailing its next generations and the path the company would take to overcoming the 99% market share behemoth in performance and offerings. These were seen as lofty goals, and many sat back willing to watch others take the gamble.

1st Gen EPYC Launch

As the first generation Naples was launched, it offered impressive some performance numbers. It didn’t quite compete in all areas, and as with any new platform, there were some teething issues to begin. AMD kept the initial cycle to a few of its key OEM partners, before slowly broadening out the ecosystem. Naples was the first platform to offer extensive PCIe 3.0 and lots of memory support, and the platform initially targeted those storage or PCIe heavy deployments.

2nd Gen EPYC Launch

The second generation Rome, launched in August 2019 (+26 months) created a lot more fanfare. AMD’s newest Zen 2 core was competitive in the consumer space, and there were a number of key design changes in the SoC layout (such as moving to a NUMA flat design) that encouraged a number of skeptics to start to evaluate the platform. Such was the interest that AMD even told us that they had to be selective with which OEM platforms they were going to assist with before the official launch. Rome’s performance was good, and it scored a few high-profile supercomputer wins, but more importantly perhaps it showcased that AMD was able to execute on that roadmap back in June 2017.

That flat SoC architecture, along with the updated Zen 2 processor core (which actually borrowed elements from Zen 3) and PCIe 4.0, allowed AMD to start to compete on performance as well as simply IO, and AMD’s OEM partners have consistently been advertising Rome processors as compute platforms, often replacing two Intel 28-core processors for one AMD 64-core processor that also has higher memory support and more PCIe offerings. This also allows for compute density, and AMD was in a place where it could help drive software optimizations for its platform as well, extracting performance, but also moving to parity on the edge cases that its competitors were very optimized for. All the major hyperscalers also evaluated and deployed AMD-based offerings for their customers, as well as internally. AMD’s sticker of approval was pretty much there.

3rd Gen EPYC CPU

And so today AMD is continuing that tour of Italy with a trip to Milan, some +19 months after Rome. The underlying SoC layout is the same as Rome, but we have higher performance on the table, with additional security and more configuration options. The hyperscalers have already been getting the final hardware for six months for their deployments, and AMD is now in a position to help enable more OEM platforms at launch. Milan is drop-in compatible with Rome, which certainly helps, but with Milan covering more optimization points, AMD believes it is in a better position to target more of the market with high performance processors, and high per-core performance processors, than ever before.

AMD sees the launch of Milan as that third step in the roadmap that was shown back in June 2017, and validation on its ability to execute reliably for its customers but also offer above industry standard performance gains for its customers.

The next stop on the tour of Italy is Genoa, set to use AMD’s upcoming Zen 4 microarchitecture. AMD has also said that Zen 5 is in the pipeline.

Competition

AMD is launching this new generation of Milan processors approximately 19 months after the launch of Rome. In that time we have seen the launch of both Amazon Graviton2 and Ampere Altra, built on Arm’s Neoverse N1 family of cores.

Milan Top-of-Stack Competition
AnandTech	EPYC 7003	Amazon Graviton2	Ampere Altra	Intel Xeon
Platform	Milan	Graviton2	QuickSilver	Cascade
Processor	7763	Graviton2	Q80-33	6258R
uArch	Zen 3	N1	N1	Cascade
Cores	64	64	80	28
TDP	280 W	?	250 W	205 W
Base Freq	2450	2500	3300	2700
Turbo Freq	3500	2500	3300	4000
L3 Cache	256 MB	32 MB	32 MB	37.5 MB
PCIe	4.0 x128	?	4.0 x128	3.0 x48
DDR4	8 x 3200	8 x 3200	8 x 3200	6 x 2933
DRAM Cap	4 TB	?	4 TB	1 TB
Price	$7890	N/A	$4050	$3950

From Intel, the company has divided its efforts between big socket and little socket configurations. For big sockets (4+) there is Cooper Lake, a Skylake derivative for select customers only. For smaller socket configurations (1-2), Intel is set to launch its 10nm Ice Lake portfolio at some point this year, but as yet it still remains silent on exact dates. To that end, all we have to compare Milan to is Intel’s Cascade Lake Xeon Scalable platform, which was the same platform we compared Rome to.

Interesting times for sure.

This Review

For this review, AMD gave us remote access to several identical servers with different processor configurations. We focused our efforts on the top-of-the-stack EPYC 7763, a 280 W 64-core processor, the EPYC 7713, a 225 W 64-core processor, and the EPYC 7F53, a 280 W 32-core processor designed as the halo Milan processor for per-core performance.

On the next page we will go through AMD’s Milan processor stack, and its comparison to Rome as well as the comparison to current Intel offerings. We then go through our test systems, discussions about our SoC structure testing (cache, core-to-core, bandwidth), processor power, and then into our full benchmarks.

These pages can be accessed by clicking the links, or by using the drop down menu below.

CPU List and SoC Updates

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

120 Comments

View All Comments

mkbosmans - Tuesday, March 23, 2021 - link
Even if you have a nice two-tiered approach implemented in your software, let's say MPI for the distributed memory parallelization on top of OpenMP for the shared memory parallelization, it often turns out to be faster to limit the shared memory threads to a single socket of NUMA domain. So in case of an 2P EPYC configured as NPS4 you would have 8 MPI ranks per compute node.

But of course there's plenty of software that has parallelization implemented using MPI only, so you would need a separate process for each core. This is often because of legacy reasons, with software that was originally targetting only a couple of cores. But with the MPI 3.0 shared memory extension, this can even today be a valid approach to great performing hybrid (shared/distributed mem) code.
mode_13h - Tuesday, March 23, 2021 - link
Nice explanation. Thanks for following up!
Andrei Frumusanu - Saturday, March 20, 2021 - link
This is vastly incorrect and misleading.

The fact that I'm using a cache line spawned on a third main thread which does nothing with it is irrelevant to the real-world comparison because from the hardware perspective the CPU doesn't know which thread owns it - in the test the hardware just sees two cores using that cache line, the third main thread becomes completely irrelevant in the discussion.

The thing that is guaranteed with the main starter thread allocating the synchronisation cache line is that it remains static across the measurements. One doesn't actually have control where this cache line ends up within the coherent domain of the whole CPU, it's going to end up in a specific L3 cache slice depended on the CPU's address hash positioning. The method here simply maintains that positioning to be always the same.

There is no such thing as core-core latency because cores do not snoop each other directly, they go over the coherency domain which is the L3 or the interconnect. It's always core-to-cacheline-to-core, as anything else doesn't even exist from the hardware perspective.
mkbosmans - Saturday, March 20, 2021 - link
The original thread may have nothing to do with it, but the NUMA domain where the cache line was originally allocated certainly does. How would you otherwise explain the difference between the first quadrant for socket 1 to socket 1 communication and the fourth quadrant for socket 2 to socket 2 communication?

Your explanation about address hashing to determine the L3 cache slice may be makes sense when talking about fixing the inital thread within a L3 domain, but not why you want that that L3 domain fixed to the first one in the system, regardless of the placement of the two threads doing the ping-ponging.

And about core-core latency, you are of course right, that is sloppy wording on my part. What I meant to convey is that roundtrip latency between core-cacheline-core and back is more relevant (at least for HPC applications) when the cacheline is local to one of the cores and not remote, possibly even on another socket than the two thread.
Andrei Frumusanu - Saturday, March 20, 2021 - link
I don't get your point - don't look at the intra-remote socket figures then if that doesn't interest you - these systems are still able to work in a single NUMA node across both sockets, so it's still pretty valid in terms of how things work.

I'm not fixing it to a given L3 in the system (except for that socket), binding a thread doesn't tell the hardware to somehow stick that cacheline there forever, software has zero say in that. As you see in the results it's able to move around between the different L3's and CCXs. Intel moves (or mirrors it) it around between sockets and NUMA domains, so your premise there also isn't correct in that case, AMD currently can't because probably they don't have a way to decide most recent ownership between two remote CCXs.

People may want to just look at the local socket numbers if they prioritise that, the test method here merely just exposes further more complicated scenarios which I find interesting as they showcase fundamental cache coherency differences between the platforms.
mkbosmans - Tuesday, March 23, 2021 - link
For a quick overview of how cores are related to each other (with an allocation local to one of the cores), I like this way of visualizing it more:
http://bosmans.ch/share/naples-core-latency.png
Here you can for example clearly see how the four dies of the two sockets are connected pairwise.

The plots from the article are interesting in that they show the vast difference between the cc protocols of AMD and Intel. And the numbers from the Naples plot I've linked can be mostly gotten from the more elaborate plots from the article, although it is not entirely clear to me how to exactly extend the data to form my style of plots. That's why I prefer to measure the data I'm interested in directly and plot that.
imaskar - Monday, March 29, 2021 - link
Looking at the shares sinking, this pricing was a miss...
mode_13h - Tuesday, March 30, 2021 - link
Prices are a lot easier to lower than to raise. And as long as they can sell all their production allocation, the price won't have been too high.
Zone98 - Friday, April 23, 2021 - link
Great work! However I'm not getting why in the c2c matrix cores 62 and 74 wouldn't have a ~90ns latency as in the NW socket. Could you clarify how the test works?
node55 - Tuesday, April 27, 2021 - link
Why are the cpus not consistent?

Why do you switch between 7713 and 7763 on Milan and 7662 and 7742 on Rome?

Why do you not have results for all the server CPUs? This confuses the comparison of e.g. 7662 vs 7713. (My current buying decision )

AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

AMD EPYC 7003: 64 Cores of Milan

Peak Performance vs Per Core Performance

AMD EPYC: The Tour of Italy

Competition

This Review

Post Your Comment

120 Comments

View All Comments

mkbosmans - Tuesday, March 23, 2021 - link

mode_13h - Tuesday, March 23, 2021 - link

Andrei Frumusanu - Saturday, March 20, 2021 - link

mkbosmans - Saturday, March 20, 2021 - link

Andrei Frumusanu - Saturday, March 20, 2021 - link

mkbosmans - Tuesday, March 23, 2021 - link

imaskar - Monday, March 29, 2021 - link

mode_13h - Tuesday, March 30, 2021 - link

Zone98 - Friday, April 23, 2021 - link

node55 - Tuesday, April 27, 2021 - link

Log in

Don't have an account? Sign up now