The Neoverse V1 Microarchitecture: X1 with SVE?

Starting off with the new Neoverse V1, the design is both of a familiar origin, but also has a few distinct features that we see for the first time ever in an Arm CPU. As noted in the introduction, the V1 was designed at the same time as the Cortex-X1 by the same team at Arm’s Austin design centre, with large similarities between the two microarchitectures when it comes to the block structures.

What’s notable about the V1, in comparison to the X1 and of course the predecessor N1, is the fact that this is now an SVE capable processor, with two native 256b SIMD pipelines, and also introducing server-only features such as coherent L1I caches, bFloat16 execution capabilities, and a slew of distinct characteristics we’ll cover in just a bit.

The architectural features of the Neoverse V1 are probably the most complicated in terms of describing – essentially, it’s a v8.4 baseline architecture which also pulls v8.5 and v8.6 features in for the HPC oriented workloads the design is aimed for. Given that we talked about Armv9 only a month ago, this may seem a bit odd, but again we have to remember that the V1 has been designed some time ago and that customers have had the IP for quite a while now, taping in or having already taped out V1 processors.

The big promise of the V1 is its extremely large performance jump over the N1, coming in at an IPC increase of +50%. This sounds large, and it is, but it’s also not all that surprising given that the microarchitecture essentially is 2 microarchitecture design generations newer than the N1, even through from a infrastructure product standpoint it’s only one generation newer.

From a high-level pipeline standpoint and microarchitecture view, the Neoverse V1 is very similar to the X1. It’s still an extremely short pipeline design that has a minimum of 11 stages, with Arm putting a lot of focus on this aspect of their microarchitectures to reduce branch misprediction penalties as much as possible. This aspect of the microarchitecture has remained relatively static over the last few iterations of the Austin family of designs starting with the A76, so Arm notes that the frequency capabilities of the V1 is essentially unchanged when compared to the N1, with performance boosts coming solely from increased IPC.

The V1 sees a lot of the front-end improvements we’ve seen with the Cortex-A77 and Cortex-X1 generations, which saw larger front-end branch improvements such as a doubled up bandwidth for the decoupled fetch unit, much larger L2 BTB to up to 8K entries, and a rearranging and resizing of the lower level BTBs, with the L0 (nanoBTB) growing to 96 entries, and the L1 BTB (microBTB) no longer being present when compared to the Neoverse N1.

The V1 one when compared to the N1 also adds in new structures that hadn’t been present in the design, such as the introduction of a macro-Op cache of up to 3K decoded instructions. The dispatch bandwidth from the Mop cache is 8-wide, while the actual instruction decoder this generation is 5-wide, much the same as on the X1.

The out-of-order windows size is essentially doubled when compared to the Neoverse N1, with the ROB growing to 256 entries. This is actually a tad larger than what Arm was willing to disclose for the Cortex-X1 where the company had only talked about a “OoO window size of 224”, so in this regard this seems to be a differentiation to what we’ve seen in the X1.

On the back-end integer execution pipelines, the design also pulls in the many changes we’ve seen with the A77 generations, which amongst others include a doubling of the branch execution ports, and a new complex ALU capable of simple instructions such as additions as well as more complex operations such as multiplications and divisions.

Obviously enough, the new SIMD pipelines are very different on the V1 given that this is Arm’s first ever SVE capable microarchitecture. The design has two pipelines with seemingly two dedicated schedulers, with native capability for 256b wide SVE vectors. The design is fully backwards compatible for 128b NEON/FP operations in which the pipelines then essentially act as 4x128b units, meaning it has the same execution width as the X1 in that regard.

Compared to the N1, the new design also supports new bFloat16 and Int8 data formats which greatly increase the AI and ML inferencing performance capabilities of the core.

On the memory subsystem side, we also see the increased unit count found on the Cortex-X1, including 2 load/store units and one load unit, meaning the core is capable of up to 3 loads per cycle and 2 stores per cycle maximum.  SVE vector bandwidth is 2x32B per cycle for loads, and 32B per cycle for stores.

The core naturally includes the data parallelism improvements seen on the X1 in order to increase MLP (Memory-level parallelism) capabilities.

The L2 cache has also adopted a similar design to that of the X1, which is now 1 cycle faster at the same 1MB size, and has double the number of banks in order for increased access parallelism.

Arm here discloses a quite large reduction in the system level latency for the V1. Besides structural improvements, new generation prefetchers are a big part of this, such as the introduction of a new type of temporal prefetcher which is able to latch onto arbitrary access patterns over time and recognise subsequent iterations of the same pattern, and pull the data in.

Arm discloses that the core has new dynamic prefetching behaviour that plays a major role in reducing L2 to interconnect traffic, which is a critical metric in large core count systems where every byte of bandwidth needs to be of actual use and cannot be wasted for wrongly speculated prefetching.

A Successful 2020 for Arm - Looking Towards 2022 The Neoverse V1 Microarchitecture: Platform Enhancements
POST A COMMENT

95 Comments

View All Comments

  • mode_13h - Thursday, April 29, 2021 - link

    Uh...

    > 2013, they had the A7 (tiny), A15 (small), and A57
    > Then ARM made the leap into 64bit processing around 2016.

    A57 is a 64-bit core.

    > Contrast that to the new x86 competition in AMD

    No. Why would we do that? They were competing in totally different markets, at the time. The only partial overlap was embedded Ryzen.

    > There hasn't been any upgrades for the "tiny" portfolio, being stuck to ... Cortex A35 CPU
    > There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU

    The A35 and A55 both launched in 2017.

    > they're a joke, and easily surpassable by the competitors.

    In terms of what? PPA? Perf/W? Perf/$? Might want to be sure you're comparing apples to apples and not comparing competing "small" core with ARM "tiny".

    > There hasn't been any new "large" category for iGPUs from ARM or competitors

    Samsung is using RDNA and MediaTek is licensing a Nvidia GPU for its upcoming SoCs.

    Might want to do a little more research, before writing another longpost. I agree that A55 could use a refresh, but ARMv9 will force that, anyway. I don't even know where A35 is used, but same story, there.

    It's worth noting that ARM has also been active in the microcontroller market, with both 32-bit and 64-bit offerings.
    Reply
  • Kangal - Friday, April 30, 2021 - link

    Firstly, apologies.
    I know the A57 is 64bit, but there have been many (most?) implementations of it running in 32bit mode. The A57 was really a "rough draft" for ARM, in moving towards both "medium" sized cores and into 64bit computing. Hence, it feels more at home next to it's A7 and A15 brethren.

    The contrast is there, and necessary to show the landscape of the time. The tech industry is a fast-paced one. And if your code/calculations is agnostic, that it can run on any platform, you would consider all options (not that I recommend people go creating agnostic code, compared to specialized or hardware-accelarated code).

    The Cortex A35 launched in 2015. It's long due for an upgrade, or replacement. Where this core likes to be in is in small, low-power, and cheap devices. In particular the microcontroller market as you mentioned. ARM hasn't been as active in this field as you think they have, with many of the products being custom designs from the ODMs.

    I already mentioned the A55 was a slight refresh for the A53, and that itself is also surpassed. Have a look at Apple's "small" cores. They are Out-of-Order processors, they are slightly faster than an A73, they use slightly less power than an A53. It's mind boggling. Others disagree, and say they're actually faster than A75, and more efficient than A55... but at this scale we're splitting hairs. With that much room for difference, it's not inconceivable (heck it's likely) that an outside competitor like RISC-V will surpass the A55 in terms of Perf/W, Perf/PPA, Perf/$, or a combination of the lot. And remember, the Cortex-A53 is the most popular core out there, where it's getting stamped out on so many different Chinese products.

    Samsung isn't using Radeon iGPUs YET, and neither is MediaTek. Besides, we have yet to see them in the wild and find out details if their architecture. These might be licensed from AMD or Nvidia, but they might be "small" iGPUs instead of "large" iGPU designs. I did forget to mention that the Tegra X1, and some Nvidia SBC did actually use their "large" iGPU architecture (ie Maxwell etc).

    The gist of my rant is that ARM was a revolutionist early on, basically creating the market. Then they were extremely innovative and competitive, basically dominating the market. Now they are competitive but not as revolutionary nor as competitive/innovative as they used to. With ARMv9 they have a chance to start fresh, and return to status quo, by having a trifecta of products for the computing industry. I was pointing out the gaps in their history and portfolio. They shouldn't just focus on mobile phones, that's boring.
    Reply
  • mode_13h - Friday, April 30, 2021 - link

    > The Cortex A35 launched in 2015.

    Okay, the date I saw was wrong. It seems to have been announced in November 2015. The A55 seems to have been announced in May 2017.

    > this core likes to be in is in small, low-power, and cheap devices.
    > In particular the microcontroller market as you mentioned.

    They have actual microcontrollers, though. The A35 is still too power-hungry (and expensive?) for most IoT devices.

    > Have a look at Apple's "small" cores.

    You focus on performance and efficiency, but what about area? Apple has a narrower focus and different process, cost, & area targets than ARM.

    The point we can definitely agree on is that ARM's bottom & middle tier cores should've been refreshed more frequently. But, everyone seems to think that ARM is directly competing with Apple, but it's not. Their objectives meaningfully differ, resulting in ARM probably being driven more towards making smaller cores than Apple.

    It's only at the top end of their mobile stacks that you can really say ARM and Apple are in direct competition. However, even on something like the A78, ARM is still put in a position of having to make compromises that Apple isn't.

    > ARM was a revolutionist early on, basically creating the market.
    > Now they are competitive but not as revolutionary nor as competitive/innovative as they used to.

    That's how these things work. A small upstart has a lot of freedom. The bigger a company gets, the more constrained it becomes by its customers, its market, the cost of changing, and the downside risk. I'm still just not totally convinced that entirely explains what we're seeing.

    If they can manage to cleave their server cores entirely from their mobile cores, and then really make big cores that are performance-first (instead of scaled up versions of mostly-performance cores, like the X1 and A78 situation), then we might see them start to compete at Apple's level. Basically, to compete they'd have to start by designing the X1 first, and then make the A78 by putting it on a diet.

    > They shouldn't just focus on mobile phones, that's boring.

    LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.
    Reply
  • grant3 - Saturday, May 1, 2021 - link

    > LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.

    Focusing on the same-ol' same-ol' business is exactly how once-profitable companies fade into irrelevance as technology moves on. Plenty of mediocre CEOs do that.

    A great CEO can find the future revenue opportunities and prove it to the company's owners.
    Reply
  • mode_13h - Sunday, May 2, 2021 - link

    Yeah, but you can't afford to walk away from your bread and butter. Any new growth areas you pursue can't come at the expense of revenues in your core business. If you even threatened to starve your core business, you'd be out of a job before your new ambitions could ever get off the ground.

    Just look at what happened with Qualcomm, they tried to invest in new areas, but their investors absolutely wouldn't tolerate it. Granted, they're more exposed than ARM would be, either under Soft Bank or Nvidia.
    Reply
  • Kangal - Sunday, May 2, 2021 - link

    No, grant3 is exactly right.

    What you said is EXACTLY what Blockbuster said before they went bankrupt. In case you didn't know, the board members passed the opportunity to buy Netflix for $50 Million. The CEO then tried to right that wrong by acquiring another competitor, and shifting their revenue stream. The board fired their CEO, saying that their late-fee revenue was the bread and butter of their business model. Blockbuster was too narrow focused and stuck in the past, that not only did they miss the opportunity of becoming a whole new behemoth, but they sunk their own ship at the same time.
    Reply
  • mode_13h - Sunday, May 2, 2021 - link

    > What you said is EXACTLY what Blockbuster said before they went bankrupt.

    If grant3 is saying that Blockbuster should close half its stores while they're still profitable, to divert money into R&D on getting into the (then) almost non-existent streaming market, no company in the world would do that.

    Now, it's not like ARM is ignoring other markets, of course. They just can't turn their back on the mobile market, in order to do so.

    > Blockbuster was too narrow focused and stuck in the past

    The genius of capitalism is that the failure of Blockbuster to transition into a streaming platform didn't keep streaming from happening. Its investors could even get in on the game by shifting their investments into players in the streaming market. If the CEO was such a believer, he could've quit and gone to work for a streaming company or founded his own.

    Also, let's not forget that there have already been losers in streaming, and it wasn't clear Netflix would've successfully made the transition from movies-by-mail. Who remembers Google Video? Yahoo even bought some company in the space. And just last year, there was quibbi. I'm sure there are others I'm forgetting.

    I think we all want to see ARM succeed outside of mobile. They're been investing a lot, in order to do so. Some in this very thread have been complaining at their lack of focus on their smaller, lower-power cores (currently A35 & A55), which you could see as evidence they've already been making sacrifices to try and compete outside their niche. I don't know if that's accurate, but it's plausible.

    If Nvidia's acquisition goes through (as I expect it will), I hope and expect it will provide ARM with the funds to do even more ambitious things.
    Reply
  • Spunjji - Friday, April 30, 2021 - link

    That's a sound argument for that expectation - it's definitely long since past time for an update. Reply
  • dotjaz - Tuesday, April 27, 2021 - link

    Why would you need rumours when we know for a FACT that there will be an A55 successor unless b.L design is abandoned for no good reason. I'll give you a hint, b.L can't have mixed architectures that's why big cores stayed at ARMv8.2a for so long. Reply
  • eastcoast_pete - Tuesday, April 27, 2021 - link

    Maybe the shift to ARMv9 will force ARM's hand with giving the LITTLE cores out-of-order designs; however, current bigLITTLE designs already mix big, out-of-order designs with LITTLE in-order cores like the A55. So, bL can and has worked with mixed architectures for quite a while. However, I hope you are correct in that the shift to ARMv9 will force the issue, and we'll finally get out-of-order LITTLE cores also on non-Apple devices Reply

Log in

Don't have an account? Sign up now