Barcelona Architecture: AMD on the Counterattack
by Anand Lal Shimpi on March 1, 2007 12:05 AM EST- Posted in
- CPUs
Even More Tweaks
Translation Lookaside Buffers, TLBs for short, are used to cache what virtual addresses map to physical memory locations in a system. TLB hit rates are usually quite high but as programs get larger and more robust with their memory footprint, microprocessor designers generally have to tinker with TLB sizes to accommodate. With K8 AMD increased the size of its TLBs over K7, and with Barcelona AMD is repeating the process once more.
Barcelona's TLBs are slightly larger than K8's, but they now include support for 1G pages which are useful for database applications and virtualized workloads. AMD also introduced a 128 entry 2M L2 TLB with Barcelona, once again to help cope with newer programs using larger page sizes. The TLB improvements to Barcelona won't make any sort of tangible impact on desktop applications, but enterprise performance should improve in server applications with large memory footprints.
When Intel introduced its second Pentium M, codenamed Dothan, one of the enhancements made was a lower integer divide latency. Although details at the time are slim, AMD has indicated that it has moved to reduce integer divide latency in Barcelona as well. We're not sure if the changes implemented are similar in any way to what Intel did with Dothan, but don't expect the performance improvement to be vastly noticeable in real world applications. It's one of those tweaks that will add up to overall more efficient execution but not one that's going to give you double digit performance gains across the board.
In another attempt to effectively "widen" Barcelona without committing a significant amount of transistors to doing so, AMD took a couple of instructions that were microcoded and turned them into fastpath decode instructions. A microcoded instruction takes significantly longer to decode than an instruction able to go through one of the core's fastpath decoders. CALL and RET-Imm instructions are now fastpath, which is a part of Barcelona's sideband stack optimization enhancements. MOVs from SSE registers to integer registers are now fastpath as well.
While on the topic of instructions, AMD also introduced a few new extensions to its ISA with Barcelona. There are two new bit manipulation instructions: LZCNT and POPCNT. Leading Zero Count (LZCNT) counts the number of leading zeros in an op, while Pop Count counts the leading 1s in an op. Both of these instructions are targeted at cryptography applications.
AMD also introduced four new SSE extensions: EXTRQ/INSERTQ, MOVNTSD/MOVNTSS. The first two extensions are mask and shift operations combined into a single instruction, while the latter two are scalar streaming stores (streaming stores that can be done on scalar operands). We may see some of these same instructions included in Penryn and other future Intel processors.
Translation Lookaside Buffers, TLBs for short, are used to cache what virtual addresses map to physical memory locations in a system. TLB hit rates are usually quite high but as programs get larger and more robust with their memory footprint, microprocessor designers generally have to tinker with TLB sizes to accommodate. With K8 AMD increased the size of its TLBs over K7, and with Barcelona AMD is repeating the process once more.
Barcelona's TLBs are slightly larger than K8's, but they now include support for 1G pages which are useful for database applications and virtualized workloads. AMD also introduced a 128 entry 2M L2 TLB with Barcelona, once again to help cope with newer programs using larger page sizes. The TLB improvements to Barcelona won't make any sort of tangible impact on desktop applications, but enterprise performance should improve in server applications with large memory footprints.
When Intel introduced its second Pentium M, codenamed Dothan, one of the enhancements made was a lower integer divide latency. Although details at the time are slim, AMD has indicated that it has moved to reduce integer divide latency in Barcelona as well. We're not sure if the changes implemented are similar in any way to what Intel did with Dothan, but don't expect the performance improvement to be vastly noticeable in real world applications. It's one of those tweaks that will add up to overall more efficient execution but not one that's going to give you double digit performance gains across the board.
In another attempt to effectively "widen" Barcelona without committing a significant amount of transistors to doing so, AMD took a couple of instructions that were microcoded and turned them into fastpath decode instructions. A microcoded instruction takes significantly longer to decode than an instruction able to go through one of the core's fastpath decoders. CALL and RET-Imm instructions are now fastpath, which is a part of Barcelona's sideband stack optimization enhancements. MOVs from SSE registers to integer registers are now fastpath as well.
While on the topic of instructions, AMD also introduced a few new extensions to its ISA with Barcelona. There are two new bit manipulation instructions: LZCNT and POPCNT. Leading Zero Count (LZCNT) counts the number of leading zeros in an op, while Pop Count counts the leading 1s in an op. Both of these instructions are targeted at cryptography applications.
AMD also introduced four new SSE extensions: EXTRQ/INSERTQ, MOVNTSD/MOVNTSS. The first two extensions are mask and shift operations combined into a single instruction, while the latter two are scalar streaming stores (streaming stores that can be done on scalar operands). We may see some of these same instructions included in Penryn and other future Intel processors.
83 Comments
View All Comments
chucky2 - Friday, March 2, 2007 - link
Can you post the link that originates at AMD's own website then that says specifically that AM2+ CPU's are guaranteed to work - understandably maybe not supporting every new feature - in current AM2 boards?Not a news post from DailyTech, The Inquirer, Toms, whatever...one that's on AMD's site itself.
And No, AMD could make AM2+ completely incompatible with current AM2 boards and they probably wouldn't see much drop if at all from the large OEM's. The large OEM's would just ensure that when the AM2+ CPU's came in, AM2+ motherboards would likewise come in.
Believe me, I want to see the link...because I'm desperately awaiting 690G or MCP68, whichever comes first (which is probably MCP68 at the pace AMD is moving on 690G).
Chuck
yacoub - Thursday, March 1, 2007 - link
You say 128kb L1 per core but the diagram image just beneath that text shows a 64bit L1 cache. Please confirm which it is.
Thanks.
Awesome article, btw. Seems like quite a significant group of changes to the CPU. Looking forward to seeing how it stacks up against the best Quad Core2 Intel can offer. =)
yacoub - Thursday, March 1, 2007 - link
also, please forgive my hasty typing - I wrote "128kb" and "64bit" - I meant "128KB" and "64KB"JarredWalton - Thursday, March 1, 2007 - link
L1 is 128K total - 64K data and 64K instruction.Beenthere - Thursday, March 1, 2007 - link
AMD doesn't do knee-jerk reactions like Intel because AMD has superior products. AMD continues to take market share from Intel in every segment and Barcelona will continue that trend. Barcelona looks to be every bit as superior to Intel's hacked/patched/glued together chips as Opteron was when introduced. Intel's chips depend on huge cache size for their performance and that crutch won't work after the intro of Barcelona.For those without a clue, AMD didn't start design of Barcelona last week or last year. It's been in the development pipeline for many years and thr performance will demonstrate exactly why AMD's long term platform stability is the right choice for most enterprise buyers. Intel is gonna feel the pain again.
Roy2001 - Thursday, March 1, 2007 - link
Facts please, no BS.zsdersw - Thursday, March 1, 2007 - link
Idiocy incarnate.Regs - Thursday, March 1, 2007 - link
AMD, like Intel, start numerious projects. Just not all of them get to this finish line. Actually a lot of them don't even reach the end of the planning phase before being scratched.As for Intel and their large caches...well I'd say it's amazing how half their die (if not more) is used for cache and still had enough space for all the core logic that's kicking the crap out of the K8 now.
Common sense!
erwos - Thursday, March 1, 2007 - link
Looks like some good improvements coming down the pipe. The cache size issue makes me nervous, though - 512kb per core is starting to look a little antiquated, and there's no information about the bandwidth to the L3 cache (which, presumably, is slower than L2).SmokeRngs - Thursday, March 1, 2007 - link
In the past, AMD did not need the large cache sizes that Intel did for their processors. This was very obvious in regards to the Netburst architecture. However, while Core2 is much better than Netburst there are still disadvantages for Intel.I'll explain a little background as far as I understand it. In the K7 and Netburst days, Intel had to have the cache to make up for their long pipeline. Branch mispredictions are going to happen and the penalty on the long pipeline of the Netburst processors hurt their IPC badly. The shorter pipeline on the K7 did not have the same performance penalty due to the shorter pipeline. With K8, the on die memory controller also negated the need for large L2 caches due to the reduced latency when accessing main memory. This has been one of the major performance aspects for the K8 architecture.
The Core2 architecture obviously does not have the on die memory controller so the need for larger caches is still present and Intel sees improvement due to the larger caches. Barcelona still has the on die memory controller and the previous efficiency is still there and still negates the need for large caches. This is just the difference between architectures. While having a larger cache on the K8 did improve performance some in some usage scenarios, it wasn't on the same scale as the improvements Intel received with a larger cache.
AMD can't compete with Intel in regards to cache size. However, other architecture differences make up for the lack of large amounts of cache. Barcelona having a smaller cache does not seem to be a big problem. If it was a big problem, AMD probably would have gone with a larger cache to get the extra performance. Bigger does not always mean better or at least enough better to warrant the extra.
Smaller cache will mean fewer transistors which should mean better yields, lower power consumption and cheaper to produce.