Hot Chips 2021 Live Blog: DPU + IPUs (Arm, NVIDIA, Intel)
by Dr. Ian Cutress on August 23, 2021 5:20 PM EST- Posted in
- SoCs
- Intel
- Arm
- Trade Shows
- NVIDIA
- Neoverse
- DPU
- IPU
- Hot Chips 33
- SmartNIC
05:27PM EDT - Welcome to Hot Chips! This is the annual conference all about the latest, greatest, and upcoming big silicon that gets us all excited. Stay tuned during Monday and Tuesday for our regular AnandTech Live Blogs.
05:30PM EDT - Just waiting for this session to start, should be a couple of minutes
05:30PM EDT - Arm up first with its Neoverse N2 cores
05:34PM EDT - Roadmap, objectives, core architecture, system architecture, performance, conclusions
05:34PM EDT - Second generation infrastructure followiung N1
05:34PM EDT - 4-128 core designs
05:35PM EDT - 5G infrastructure to cloud data centers
05:35PM EDT - Arm sells IP and definitions
05:35PM EDT - SBSA/SBBR support
05:36PM EDT - Marvell is already using N2, up to 36 in an SoC
05:36PM EDT - High speed packet processing
05:36PM EDT - All about SpecINT score with DDR5 and PCIe 5.0
05:37PM EDT - N2 with Arm v9
05:37PM EDT - Two lots of Scalable Vector Extensions, SVE, SVE2
05:37PM EDT - BF16 support, INT8 mul
05:38PM EDT - Side channel security, SHA, SM3/4
05:38PM EDT - *SHA3/SHA512
05:38PM EDT - Persistent memory support
05:38PM EDT - memory partition and monitoring
05:39PM EDT - Gen on gen improvements with virtualization
05:39PM EDT - +40% IPC uplift
05:39PM EDT - Similar power/area as N1, maximizes perf/Watt
05:39PM EDT - an intense PPA trajectory
05:40PM EDT - 3.6 GHz max core frequency
05:40PM EDT - N1 on 7nm, vs N2 on 5nm
05:41PM EDT - uArch - Most structures are biggers
05:41PM EDT - bigger
05:42PM EDT - Fetch more per cycle on the front end - increase branch prediction accuracy
05:42PM EDT - Enhanced security to prevent side-channel
05:43PM EDT - More bigger structures on the back end
05:44PM EDT - N2 has Correlated Miss Caching (CMC) prefetching
05:45PM EDT - Latency improvement on L2 as a result of CMC
05:45PM EDT - 32% IPC improvement at iso-frequency
05:46PM EDT - SPEC2006 was 40% mentioned earlier
05:47PM EDT - Coherent Mesh Network - CMN700 - chiplets and multi-socket
05:47PM EDT - Also CXL support
05:48PM EDT - improvements over 600 - double mesh links, 3x cross sectional BW
05:48PM EDT - Programmable hot-spot re-routing
05:49PM EDT - Composable Datacenter SoCs - chiplets and IO dies and super home dies
05:51PM EDT - balancing memory requests
05:51PM EDT - control for capacity or bandwidth
05:52PM EDT - Cbusy - throttling outstanding transactions to the CPU - affects hardware prefetcher aggressiveness
05:53PM EDT - Cbusy and MPAM meant to work together
05:54PM EDT - Resulting in best performance
05:56PM EDT - Compared to the market with N2
05:56PM EDT - integer performance only
05:57PM EDT - 'Real world workload' numbers based on pre-silicon models
05:58PM EDT - Up to 256 cores of N2 should be fun
05:59PM EDT - hit the market in the next few months
05:59PM EDT - Q&A
05:59PM EDT - Q: Is N1/N2 at iso-freq - what freq on slide 10? A: a range of power modes, quoted 2-2.5 GHz which is what customers will use
06:01PM EDT - Q: Cbusy for a heterogeneous multi-die system? A: All IPs will get the CBusy information and throttle requests,
06:03PM EDT - Q: MPAM cache partitioning? weight? A: It can do. But also support fine grain threshholds for control - you can tune based on capacity without overpartitioning
06:03PM EDT - Second talk of the session - NVIDIA DPU
06:04PM EDT - Idan Burstein, co-authored NVMoF
06:04PM EDT - Architecture and platform use-cases
06:05PM EDT - Data center is going through a revolution
06:05PM EDT - Fully disaggregate your server between compute, memory, acceleration, storage, and software. Requires accelerated networking and DPUs to control it all
06:06PM EDT - 10-20x bandwidth deployed per server requires better networking
06:06PM EDT - a Datacenter infrastructure workload
06:08PM EDT - Moving infrastructure workloads to the CPU is a bad idea
06:08PM EDT - Need appropriate offload
06:08PM EDT - Data pass acceleration needed
06:09PM EDT - Bluefield-2
06:09PM EDT - Roadmap
06:09PM EDT - Currently shipping BF-2, announced BF-3 with double bandwidth
06:09PM EDT - BF-4 is 4x BF-3
06:09PM EDT - BF-4 also uses NVIDIA AI
06:10PM EDT - 22 billion transistors
06:10PM EDT - PCIe 5.0 x32
06:10PM EDT - 400 Gb/s Crypto
06:10PM EDT - 300 equivalent x86 cores
06:10PM EDT - 16 cores of Arm A78
06:10PM EDT - DDR5-5600, 128-bit bus
06:11PM EDT - supports 18m IOPs
06:11PM EDT - Connect-X 7
06:11PM EDT - DOCA Framework
06:11PM EDT - Program on DOCA on BF-2, scales immediately to BF-3 and BF-4
06:12PM EDT - 3 different programmable engines
06:12PM EDT - 16x Arm A78 - server level processor
06:12PM EDT - 16 cores, 256 threads (SMT16?) datapath accelerator
06:12PM EDT - ASAP - As soon as possible programmable packet processor flow pipeline
06:13PM EDT - Bluefield-4X
06:13PM EDT - Bluefield-3X
06:13PM EDT - Not ASAP, ASAP-squared
06:15PM EDT - Isolated boot domain in RT OS
06:16PM EDT - PCIe tuned for DPU
06:17PM EDT - Differentiating between the datapaths - software defined networking stack
06:18PM EDT - accelerating the full path from host to network
06:18PM EDT - it says bare metal host - can it do virtual hosts?
06:19PM EDT - Encryption, Tunneling, NAT, Routing, QoS, Emulation
06:19PM EDT - 100G DPDK
06:19PM EDT - Million packets per second
06:20PM EDT - vs AMD EPYC 7742 64-core
06:20PM EDT - This is Bluefield 2
06:20PM EDT - TCP flow with 100G IPSEC
06:22PM EDT - Storage processing acceleration
06:22PM EDT - Data in-flight encryption
06:24PM EDT - NVMe over Fabric
06:27PM EDT - Cloud-native supercomputing with non-blocking MPI performancew
06:27PM EDT - Accelerated FFT performance across multi-node HPC
06:28PM EDT - DPU isolates Geforce Now - 10 million concurrent users
06:28PM EDT - +50% more users per server
06:28PM EDT - Push more concurrent users
06:29PM EDT - Bluefield 3X has onboard GPU
06:30PM EDT - Support for CUDA
06:30PM EDT - GPU + DPU + network connectivity, fully programmable on a single PCIe card
06:31PM EDT - Q&A Time
06:31PM EDT - Q: Cortex A rather than N1/N2 A: A78 was the most performing core at the time
06:31PM EDT - Q: Add CXL to future Bluefield? A: Can't comment. See CXL as important
06:33PM EDT - Q: RT-OS cores? A: Designed internally, arch is RISC-V compatible
06:34PM EDT - Q: Can DPU accelerate RAID construction? A: Yes it can - trivial and complex
06:36PM EDT - That's the end, next up is Intel
06:36PM EDT - Bradley Burres
06:37PM EDT - Driving network across the datacenter for Intel
06:38PM EDT - Same five minute intro about IPUs as the previous talks
06:40PM EDT - IPU over time has been gaining more control to free up CPU resources. Move these workloads to the IPU = more performance!
06:41PM EDT - 'solving the infrastructure tax'
06:41PM EDT - Mount Evans
06:41PM EDT - Developed with a CSP
06:41PM EDT - Baidiu or JD ?
06:42PM EDT - 16 Neoverse N1 cores
06:42PM EDT - 200G Ethernet MAC
06:42PM EDT - PCIe 4.0 x16
06:42PM EDT - NVMe storage with Optane recognition
06:42PM EDT - Advanced crypto and compression acceleration
06:42PM EDT - Software, Hardware, Accelerator co-design
06:43PM EDT - solving the long-tail infrastructure issue
06:44PM EDT - Dataplane on the left, compute on the right
06:44PM EDT - Support 4 socket systems with one Mount Evans
06:44PM EDT - RDMA and ROCE v2
06:44PM EDT - QoS and telemetry up to 200 million packets per second
06:45PM EDT - Inline IPSec
06:46PM EDT - N1s at 3 GHz
06:46PM EDT - three channels of dual mode LPDDR4 - 102 GB/s bandwidth
06:46PM EDT - Engines for crypto
06:47PM EDT - Intel didn't just glue assets together
06:49PM EDT - P4 programmable pipeline
06:51PM EDT - Most applications for IPU is 'brownfield' - has to be dropped in to current infrastructure
06:54PM EDT - Now talking system security with isolation and recovery independent of workloads and tenants
06:55PM EDT - QoS, uptime
06:56PM EDT - malicious driver detection
06:56PM EDT - Futureproofing
06:56PM EDT - Compliant to NSA standards and FIPS140. Said something about 2030?
06:57PM EDT - More info at the intel On event
06:57PM EDT - Q&A Time
06:58PM EDT - Q: PPA with Arm vs IA A: IP available and schedule picked Arm
06:59PM EDT - Q: SBSA compliant? A: Yes
06:59PM EDT - Q: TDP? A: work within PCIe power
07:00PM EDT - Q: Work with SPR given both crypto? A: Yes
07:00PM EDT - Q: Does Mount Evans replace server PCH? A: No, orthogonal
07:02PM EDT - Q: specific to Xeon A: Use with any CPU
07:02PM EDT - THat's a wrap, time for a keynote!
07:02PM EDT - .
12 Comments
View All Comments
name99 - Tuesday, August 24, 2021 - link
I think a theory like that depends on the larger issue of "who lands up providing for the data center"?- nV doesn't have the CPU to provide the sort of total package you have in mind.
- Intel and AMD both seem uninterested in truly grasping the implications of SoC and providing *total* packages. Or to put it differently, both have become so addicted to data center money that they see the low-end as something that can be cobbled together from high-end scraps; that might seem like a sensible decision, but it means they're missing the sorts of tight cross-device integration we've seen on SoCs for years.
- so who's left? AMZ (and Altera and whoever plans their own cloud chips) may well see things your way. They have the problem of limited resources, but they are willing to be sensible bout buying what they don't have internally.
- almost everything I see here (and in the early session) I see already on Apple's SoC.
UMA is of course the consistent single virtual address space Intel talked about.
Likewise the tagging of every NoC transaction for various telemetry/performance monitoring/optimization purposes like ARM N2's MPAM.
As I've pointed out before, Apple's DMA fabric does most of the tasks of these accelerators, from crypto and compression to bulk movement to TCP offload, along with the matrix multiply accelerator that everyone now feels they need. (Sort is an interesting IBM accelerator that I haven't seen in Apple or anywhere else.)
So Apple seems closest to the vision you have of designing this stuff from the ground up as a unity, not as a collection of separate chips that hopefully work together. Of course Apple don't sell server or data center chips... But there's more than one way to skin a cat...
But to circle back to your point, as long as data center equipment is seen as extensions to AMD and INTC CPUs, the current situation will continue. So WILL data center equipment continue that way? Amazon is providing one track to shake things up. A second even more abstracted track could be Apple Cloud Services?
OreoCookie - Wednesday, August 25, 2021 - link
Why would nVidia need its own CPU cores? They currently license ARM cores, and ARM has a plethora of cores for most imaginable scenarios. Plus, they are trying (perhaps unsuccessfully) to buy ARM.