Naples, Rome, Milan, Zen 4: An Interview with AMD CTO, Mark Papermaster

Name: Naples, Rome, Milan, Zen 4: An Interview with AMD CTO, Mark Papermaster
Item: Naples, Rome, Milan, Zen 4: An Interview with AMD CTO, Mark Papermaster
Author: Dr. Ian Cutress

by Ian Cutress on November 12, 2018 9:15 AM EST

57 Comments | Add A Comment

57 Comments

At AMD’s Next Horizon event this week, the company disclosed for the first time the layout of its next generation EPYC processor, the new Vega Radeon Instinct datacenter compute accelerators, as well as a strong confidence that its execution on 7nm will be a big win. If there’s anyone at AMD prepared to talk about execution, it’s the Chief Technology Officer and SVP of Technology and Engineering, Mark Papermaster.

Matching and Exceeding Expectations

The goal of AMD’s event in the middle of the fourth quarter of the year was to put into perspective two elements of AMD’s strategy: firstly, its commitment to delivering a 7nm Vega based product by the end of the year, as the company promised in early 2018, but also to position its 7nm capabilities as some of the best by disclosing the layout of its next generation enterprise processor set to hit shelves in 2019. This is, of course, a far cry from AMD during the Bulldozer days – the excitement in the presentation room from AMD’s executive staff was palpable. It’s rare to see so much latent energy and excitement in what could have been a series of dull presentations, but the ability for AMD to execute and compete in this new Zen era is one that gets the adrenaline going up and down the employee chain.

Ian Cutress
AnandTech

Mark Papermaster
AMD

For us, the press, information is our currency. AMD showing off its new 7nm EPYC and 7nm Instinct designs was a nice touch to the events of the day, however we always want more. Not only is AMD putting down its capabilities today, but it is putting expectations through its roadmap for investors and customers in the months and years ahead. We sat down with AMD’s CTO, Mark Papermaster, to see if we could squeeze some of the finer details about both AMD’s strategy and the finer points of some of the products from the morning sessions.

AMD’s Vision of the Datacenter

Ian Cutress: Is AMD all about the roadmaps now?

Mark Papermaster: I’m happy about today is that it really reaffirms to the industry AMD’s execution. We set out an ambitious roadmap, and we are proving our mettle.

IC: When you say that you made decisions for this product 3-5 years ago, you weren’t joking, were you? Decisions you are making today are then for products set in 2021 and beyond?

MP: Absolutely correct. You really have to call the ball right in this industry. If you don’t, you create a gap – a significant multi-year gap.

IC: Forrest explained on the stage that the datacenter of today is very different to the datacenter ten years ago (or even 3-5 years ago). What decisions are you making today to predict the datacenter of the future?

MP: We believe we will be positioned very well – it all ties back to my opening comments on Moore’s Law. We all accept that the traditional Moore’s Law is slowing down, and that while process does still matter you have to be agile about how you put the pieces together, otherwise you cannot win. We leveraged ourselves to have scalability in our first EPYC launch. We leveraged our ability in our chiplet approach here to combine really small 7nm CPU dies with tried and proven 14nm for the IO die. That modularity only grows in importance going forward. We’ve stated our case as to where we believe it is necessary to keep pace on a traditional Moore’s Law growth despite the slowing of the process gains per node and the length of time between major semiconductor nodes. I think you’ll see others adopt what we’ve done with the chiplet approach, and I can tell you we are committed.

We’ve re-architected what we do to allow ourselves to be agile and putting our IPs together and eventually this will be an ecosystem based on our IPs. That ecosystem will have others over time because who can predict exactly which ML accelerators may be disruptive in the market, and who can predict what new memory configuration might provide an advantage. Those companies that are agile and adaptable will win and this is the crux of AMD’s strategy. We will be able to drive the base of the CPU and GPU roadmap that keeps us at the front of the leadership and have a modularity and flexibility to adapt to the workloads as they merge.

Diving into Rome

IC: Can you confirm where the parts of Rome are manufactured?

MP: Chiplets on TSMC 7nm, the IO die is on GlobalFoundries 14nm.

IC: With all the memory controllers on the IO die we now have a unified memory design such that the latency from all cores to memory is more consistent?

MP: That’s a nice design – I commented on improved latency and bandwidth. Our chiplet architecture is a key enablement of those improvements.

IC: When you say improved latency, do you mean average latency or peak/best-case latency?

MP: We haven’t provided the specifications yet, but the architecture is aimed at providing a generational improvement in overall latency to memory. The architecture with the central IO chip provides a more uniform latency and it is more predictable.

IC: There are a lot of questions surrounding AMD’s cache hierarchy that you probably can’t answer, but can you at least state if the IO chip has a cache inside?

MP: All the details within the IO die and beyond what we said today will be forthcoming at our EPYC 2 launch.

IC: The IO die as showed in the presentation looked very symmetrical, almost modular in itself. Does that mean it can be cut into smaller versions?

MP: No details at this time.

IC: Do the chiplets communicate with each other directly, or is all communication through the IO die?

MP: What we have is an IF link from each CPU chiplet to the IO die.

IC: When one core wants to access the cache of another core, it could have two latencies: when both cores are on the same chiplet, and when the cores are on different chiplets. How is that managed with a potentially bifurcated latency?

MP: I think you’re trying to reconstruct the detailed diagrams that we’ll show you at the product announcement!

IC: Under the situation where we now have a uniform main memory architecture, for on-chip compared to chip-to-chip there is still a near and a far latency…

MP: I know exactly where you’re going and as always with AnandTech it’s the right question! I can honestly say that we’ll share this info with the full product announcement.

IC: One of the biggest changes when Zen came out was the op-cache, and AMD is saying it’s now larger.

MP: You know we made so many front-end changes on Zen and we had ideas of even more what we could do, but as you know at some point we have to bring the product to market. So some ideas end up in the next generation where we get the low hanging fruit and also implement new technologies. The engineers had many more ideas after Zen, some of which made it into Zen 2, and we will see these sorts of generational improvements. They are very nice enhancements in Zen 2, including improved efficiency at the front end to feed the revamped execution units.

IC: With the FP units now capable of doing 256-bit on their own, is there a frequency drop when 256-bit code is run, similar to when Intel runs AVX2?

MP: No, we don’t anticipate any frequency decrease. We leveraged 7nm. One of the things that 7nm enables us is scale in terms of cores and FP execution. It is a true doubling because we didn’t only double the pipeline with, but we also doubled the load-store and the data pipe into it.

IC: Now the Zen 2 core has two 256-bit FP pipes, can users perform AVX512-esque calculations?

MP: At the full launch we’ll share with you exact configurations and what customers want to deploy around that.

IC: AMD’s consumer side actively promotes technologies such as Precision Boost 2 and XFR2 to manage frequency and load with other variables in the mix. Are any of those implemented in Rome?

MP: Details to follow at the product announcement!

IC: Can you go into detail about the additional hardware Spectre mitigations?

MP: I simply call it a hardening. We shared the approaches that we took with microcode and software for Spectre (Zen is not affected by Meltdown or Foreshadow), you can read our whitepapers on it, and this is about alleviating the microcode from having to intercept and take those actions. The idea is that if we harden that in hardware it becomes more ubiquitous. More info to come on that, but we did want to include a mention of the improvements in the hardware mitigations.

The Rome Product

IC: Will Rome aim for a level of price parity with Naples?

MP: We haven’t announced our focus on pricing. I think this is how to think about it: what you have seen from AMD is what we’ve targeted comes from a design point, and what we’ve shared with you today is what we are delivering. Today we said that for a given power envelope, for a given socket configuration, Rome will offer a doubling of core performance or a quadrupling in floating point performance. Pricing will come when the products are closer to shipping, and we have performance targets per socket. We will always be price competitive too, especially in performance per socket.

IC: Where does Rome sit with CCIX support?

MP: We didn’t announce specifically those attributes beyond PCIe 4.0 today, but I can say we are a member of CCIX as we are with Gen Z. Any further detail there you will have to wait until launch. Any specific details about the speeds, feeds, protocols, are coming in 2019.

IC: There have been suggestions that because AMD is saying that Rome is coming in 2019 then that means Q4 2019.

MP: We’re not trying to imply any specific quarter or time frame in 2019. If we look at today’s event, it was timed it to launch our MI60 GPU in 7nm which is imminent. We wanted to really share with the industry how we’ve embraced 7nm, and preview what’s coming out very soon with MI60, and really share our approach on CPU on Zen 2 and Rome. We’re not implying any particular time in 2019, but we’ll be forthcoming with that. Even though the GPU is PCIe 3.0 backwards compatible, it helps for a PCIe 4.0 GPU to have a PCIe 4.0 CPU to connect to!

IC: With Rome being up to 64 cores, are you seeing diminishing returns promoting dual socket configurations?

MP: No – there’s such an insatiable demand for compute and it’s about TCO. Of course, we are ensuring that real value is delivered. We’re working with ISVs to make sure they can support the increased core count as we’re blowing through what they envisioned what would be the number of cores they support on a per socket basis. All that work is well underway and at launch we will have the ISV community with us.

IC: Part of the launch of Naples was the communication that users could replace a two socket system with one Naples. Now you can replace a 2S Naples with one socket Rome!

MP: It’s very simple in this industry – our view at AMD is that we’re going to deliver the best TCO (total cost of ownership) value we can. That’s the best strategy for us, to be the upstart coming back in and fighting our way to maximize our market share gain. This is a battle and we’re bringing every tool in the shed.

IC: We expect lower core counts of Rome to eventually exist. This could mean fewer cores enabled per chiplet or just fewer chiplets entirely?

MP: We didn’t announce details, but you can see the physical modularity given the layout shared today. What you’re seeing today is eight 8-core clusters so that the easiest physical modularity is on a chiplet basis.

IC: AMD has already committed that Milan, the next generation after Rome, will have the same socket as Rome. Can you make the same commitment with Zen 4 that was shown on the roadmap slides?

MP: We’re certainly committed to that socket continuity through Milan, and we haven’t commented beyond that. Obviously at some point the industry transitions to PCIe 5.0 and DDR5 which will necessitate a socket change.

IC: So one might assume that an intercept might occur with Zen 4?

MP: No comment (!)

Revisiting Infinity Fabric

IC: One of the key aspects in AMD’s portfolio is the Infinity Fabric, and with Rome you have stated that AMD is now on its second generation IF. Do you see an end in its ability to scale down in process node but also scale out to more chiplets and different IP?

MP: I don’t see an end because the IF is made of both of Scalable Data Fabric and a Scalable Control Fabric. The SCF is the key to giving the modularity and that’s an architectural product. With our SDF we are very confident on the protocols we developed. The SCF protocols are based on the rich history we have with HyperTransport and we are committed in it generationally to improve bandwidth and latency every generation. IF is important when it applies to on chip connectivity, but it can go chip to chip like we did with EPYC, and also with Vega Radeon Instinct in connecting GPU to GPU. For the chip to chip IF, you are also dependent on the package technology. We see tremendous improvements in package technology over the next five years.

IC: One of the key aspects of IF is that it has started becoming a big part of the power consumption on EPYC. What is AMD doing to manage that power?

MP: Fabrics going forward have to be developed with the same energy efficiency optimizations in mind as with our core engines, such as our CPUs and GPUs. When we started with IF in our CPUs, we started with tremendous energy efficiency for our high performance objectives. We are making generationally excellent performance per watt improvements in our core engines, and we have IF on the same track. Energy efficiency is required in any of the markets that we serve. Therefore we are driving every IP that we develop in the company to improve energy efficiency each and every generation.

IC: Should we expect companies that aren’t AMD to implement IF?

MP: We haven’t announced any plan to license that IP, as we are not in the IP licensing business. But we certainly I could envision partnerships through our semicustom business working with our partners where the right business makes sense for both parties.

IC: With chiplets connected via IF on Rome, if a customer wanted a semi-custom design with different IP, such as a GPU or an AI block or an FPGA, would that be possible? (Say for example, a console?)

MP: Our semi-custom group is wide open to talk to customers to brainstorm! What excites me about the chiplet approach is that I think it’s going to disrupt the industry. It’s going to change the way the industry dreams of different configurations. Some might be right, and I can guarantee that someone will conjure up ten other ones that we didn’t think of! Honestly I think it is a disruptive force that is just nascent, just starting right now.

IC: With IF on 7nm, it offers 100 GB/s GPU to GPU connectivity. One of your competitors has something similar which allows both GPU-GPU and CPU-GPU connectivity. Currently with Rome, PCIe 4.0 has been announced from CPU to GPU but not IF. What has AMD’s analysis been on that CPU to GPU link?

MP: We haven’t announced applying the IF between the CPU and GPU and while it is certainly feasible, it is likely just dependent when workloads could truly leverage that protocol being applied, when the full coherency is required across both CPU and GPU. It is certainly feasible, but we haven’t announced it at this time.

AMD on Design and Manufacturing

IC: You mentioned on stage that AMD has leapfrogging design teams. How do you manage keeping positive aspects of the design if the teams are out of sync with each other?

MP: So to be clear, we have one architecture team but two implementation teams. Best practices of architecture are shared from generation to generation. The implementation teams all work under the same internal organizational team, and we’ve done everything to lower the barrier of best practices and innovation sharing within those teams. If you look in the industry, often at times we see that companies end up with competing microprocessor design teams. I’m not a believer that that is the best way to get the best microprocessor. I think in order to get ahead you need parallel efforts, you have to partition the implementation, but I’m a big believer that you come out way ahead when the teams brainstorm together on the best approaches and microarchitecture changes for performance improvement.

IC: On personnel, AMD has recently been shedding key talent to a major competitor, mostly in graphics. How is AMD approaching this alongside the inevitable competition in the GPU market which is set to follow in several years?

MP: We are constantly refreshing our talent – we have brought on tremendous talent year in and year out. AMD is recognized in the industry as an incredibly innovative company and one that is on the rise as they see our share growing in the industry and so we are attracting great talent and from time to time we have folks leave as with any company. We are still full speed ahead and we are very excited with some of the folks that have come and knocked on our door to join our team.

IC: AMD has had a strong relationship with TSMC for many years which is only getting stronger with the next generation products on 7nm, however now you are more sensitive to TSMC’s ability to drive the next manufacturing generation. Will the move to smaller chiplets help overcome potential issues with larger or dies, or does this now open cooperation with Samsung given that the chip sizes are more along the lines of what they are used to?

MP: First off, the march for high performance has brought us to Zen 2 and the ability to leverage multiple technology nodes. What we’re showing with Rome is a solution with two foundries with two different technology nodes. It gives you an idea of the flexibility in our supply chain that we’ve built in, and gives you explicit example of how we can work with different partners to achieve a unified product goal. On the topic of Samsung, we know Samsung very well and have done work with them.

IC: You expected that when you came out with 7nm in 2018/2019 that the process technology AMD used would be at parity with the competition, and your major competitor would already have 10nm products on the shelf. Today you have stated that you will be ahead of that expectation. How does AMD realign future performance goals in a situation where you are no longer playing catch-up?

MP: We can’t ever be anything other than pressing performance up. It’s a great time to be a customer of high performance! Why? Because we have competition back! When competition is there, all the players will be running as fast as they can and that is what I can guarantee from AMD. We’re going to be pressing to add performance every generation and deliver that value to our customers’ workloads.

IC: You said 7nm is meant to be a long lived node, and variations of 7nm are expected from TSMC with EUV. What resources are AMD putting in place for a post-7nm world?

MP: We have a team that is constantly working with that, like the CPU team that is looking ahead, and we do the same with our foundry preparedness teams. This includes design options, and we work with the various supporting industries as well as working with the foundries. We are always looking into future nodes across the industry.

Many thanks to Mark and his team for their time!

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

57 Comments

View All Comments

The Benjamins - Monday, November 12, 2018 - link
My key take aways are:
IC: The IO die as showed in the presentation looked very symmetrical, almost modular in itself. Does that mean it can be cut into smaller versions?
MP: No details at this time.

Makes me think that TR 3000 will be a cut down of 4 or 6 channel ram with 32 or 48c (max).

IC: With the FP units now capable of doing 256-bit on their own, is there a frequency drop when 256-bit code is run, similar to when Intel runs AVX2?
MP: No, we don’t anticipate any frequency decrease. We leveraged 7nm. One of the things that 7nm enables us is scale in terms of cores and FP execution. It is a true doubling because we didn’t only double the pipeline with, but we also doubled the load-store and the data pipe into it.

This sounds great looks like a strong move forward.

The manufacture confirmation is nice to know.
FirstStrike - Monday, November 12, 2018 - link
I suppose we got confirmation on the FP pipelines? Because AMD only vaguely said double load/store/throughput but not FPU width at the announcement.
wumpus - Sunday, November 18, 2018 - link
I think they've left it open that they could have separate multiply and add units and not two multiply/add units. Either way, it should close a lot of the gap on benchmarks previously favorable to Intel.

I still think that *most* of what works on AVX would work better on a GPU, although any double operations are likely to require a specialized GPU for non-crippled double operation.
kd_ - Monday, November 12, 2018 - link
Excellent interview as usual. Also your efforts to get MP to give more details on the IO are appreciated, but it was always going to be a long shot to get more info on that :)
I have one question, are the 2 256bit execution units confirmed, you mentioned that in one of your questions but it wasn't verified. Any chance you can send a follow up question on weather they will now follow intel's approach of 2x multiply/add 256bit vector execution units or if they are going to do with their 256bit implementation something analogous to their current 128bit design that has 2x add + 2x multiply execution units?
Thanks.
GreenReaper - Monday, November 12, 2018 - link
"At the full launch we’ll share with you exact configurations and what customers want to deploy around that." -- to me this reads as "it depends". For AVX-512 it's also a matter of "which ones do you want to implement?" Intel used this for market segmentation; AMD could do much the same. Possibly without so much cost, if they're not using 512-byte registers, but also not so much benefit.
kd_ - Monday, November 12, 2018 - link
I'm mostly interested in their 256bit execution units as they are far more relevant but also the exact implementation will determine the expected peak avx512 performance. I would guess they followed the same approach as previously but the mention of 2 execution units means they followed intel's approach of fused multiply/add units otherwise it wouldn't make sense to have 2.
Rudde - Monday, November 12, 2018 - link
From wikichip:
"The FP has a single pipe for 128-bit load operations. In fact, the entire FP side is optimized for 128-bit operations. Zen supports all the latest instructions such as SSE and AVX1/2. The way 256-bit AVX was designed was so that they can be carried out as two independent 128-bit operations. Zen takes advantage of that by operating on those instructions as two operations; i.e., Zen splits up 256-bit operations into two µOPs so they are effectively half the throughput of their 128-bit operations counterparts. Likewise, stores are also done on 128-bit chunks, making 256-bit loads have an effective throughput of one store every two cycles. The pipes are fairly well balanced, therefore most operations will have at least two pipes to be scheduled on retaining the throughput of at least one such instruction each cycle. As implies, 256-bit operations will use up twice the resources to complete (i.e., 2x register, scheduler, and ports). This is a compromise AMD has taken which helps conserve die space and power. By contrast, Intel's competing design, Skylake, does have dedicated 256-bit circuitry. It's also worth noting that Intel's contemporary server class models have extended this further to incorporate dedicated 512-bit circuitry supporting AVX-512 with the highest performance models having a whole second dedicated AVX-512 unit."
kd_ - Monday, November 12, 2018 - link
Yes. That is what happens with zen1. I am asking about zen2.
jospoortvliet - Tuesday, November 13, 2018 - link
I guess the implication is they might have done the same thing but then for 256-512 bit. Or even - double up on 128 units, but I am not sure if they can split up 512bit operations in 4x 128bit... is that even feasible?
kd_ - Tuesday, November 13, 2018 - link
No. And they don't with zen1. The software or compiler simply takes an alternative and less efficient path. But with multiple 256bit execution units they could support it directly.

Naples, Rome, Milan, Zen 4: An Interview with AMD CTO, Mark Papermaster

Matching and Exceeding Expectations

AMD’s Vision of the Datacenter

Diving into Rome

The Rome Product

Revisiting Infinity Fabric

AMD on Design and Manufacturing

Post Your Comment

57 Comments

View All Comments

The Benjamins - Monday, November 12, 2018 - link

FirstStrike - Monday, November 12, 2018 - link

wumpus - Sunday, November 18, 2018 - link

kd_ - Monday, November 12, 2018 - link

GreenReaper - Monday, November 12, 2018 - link

kd_ - Monday, November 12, 2018 - link

Rudde - Monday, November 12, 2018 - link

kd_ - Monday, November 12, 2018 - link

jospoortvliet - Tuesday, November 13, 2018 - link

kd_ - Tuesday, November 13, 2018 - link

Log in

Don't have an account? Sign up now