Qualcomm's New Snapdragon S4: MSM8960 & Krait Architecture Explored
by Brian Klug & Anand Lal Shimpi on October 7, 2011 12:35 PM EST- Posted in
- Smartphones
- Snapdragon
- Arm
- Qualcomm
- Krait
- MDP
- Mobile
- SoCs
Let's recap the current smartphone/tablet SoC landscape. Everything shipping today is built on a 4x-nm process, built either at Global Foundries, Samsung, TSMC or UMC. Next year we'll see a move to 28nm (bringing better performance and power characteristics) but between now and the end of 2012 there will be a myriad of designs available on the market.
The table below encapsulates much of what you can expect over the next 12+ months:
2011/2012 SoC Comparison | |||||||
SoC | Process Node | CPU | GPU | Memory Bus | Release | ||
Apple A5 | 45nm | 2 x ARM Cortex A9 w/ MPE @ 1GHz | PowerVR SGX 543MP2 | 2 x 32-bit LPDDR2 | Now | ||
NVIDIA Tegra 2 | 40nm | 2 x ARM Cortex A9 @ 1GHz | GeForce | 1 x 32-bit LPDDR2 | Now | ||
NVIDIA Tegra 3/Kal-El | 40nm | 4 x ARM Cortex A9 w/ MPE @ ~1.3GHz | GeForce++ | 1 x 32-bit LPDDR2 | Q4 2011 | ||
Samsung Exynos 4210 | 45nm | 2 x ARM Cortex A9 w/ MPE @ 1.2GHz | ARM Mali-400 MP4 | 2 x 32-bit LPDDR2 | Now | ||
Samsung Exynos 4212 | 32nm | 2 x ARM Cortex A9 w/ MPE @ 1.5GHz | ARM Mali-400 MP4 | 2 x 32-bit LPDDR2 | 2012 | ||
ST-Ericsson NovaThor LP9600 (Nova A9600) | 28nm | 2 x ARM Cortex-A15 @ 2.5GHz | IMG PowerVR Series 6 (Rogue) | Dual Memory | 2013 | ||
ST-Ericsson Novathor L9540 (Nova A9540) | 32nm | 2 x ARM Cortex A9 @ 1.85GHz | IMG PowerVR Series 5 | 2 x 32-bit LPDDR2 | 2H 2012 | ||
ST-Ericsson NovaThor U9500 (Nova A9500) | 45nm | 2 x ARM Cortex A9 @ 1.2GHz | ARM Mali-400 MP1 | 1 x 32-bit LPDDR2 | Now | ||
ST-Ericsson NovaThor U8500 | 45nm | 2 x ARM Cortex A9 @ 1.0GHz | ARM Mali-400 MP1 | 1 x 32-bit LPDDR2 | Now | ||
TI OMAP 4430 | 45nm | 2 x ARM Cortex A9 w/ MPE @ 1.2GHz | PowerVR SGX 540 | 2 x 32-bit LPDDR2 | Now | ||
TI OMAP 4460 | 45nm | 2 x ARM Cortex A9 w/ MPE @ 1.5GHz | PowerVR SGX 540 | 2 x 32-bit LPDDR2 | Q4 11 - 1H 12 | ||
TI OMAP 4470 | 45nm | 2 x ARM Cortex A9 w/ MPE @ 1.8GHz | PowerVR SGX 544 | 2 x 32-bit LPDDR2 | 1H 2012 | ||
TI OMAP 5 | 28nm | 2 x ARM Cortex A15 @ 2GHz | PowerVR SGX 544MPx | 2 x 32-bit LPDDR2 | 2H 2012 | ||
Qualcomm MSM8x60 | 45nm | 2 x Scorpion @ 1.5GHz | Adreno 220 | 1 x 32-bit LPDDR2* | Now | ||
Qualcomm MSM8960 | 28nm | 2 x Krait @ 1.5GHz | Adreno 225 | 2 x 32-bit LPDDR2 | 1H 2012 |
The key is this: other than TI's OMAP 5 in the second half of 2012 and Qualcomm's Krait, no one else has announced plans to release a new microarchitecture in the near term. Furthermore, if we only look at the first half of next year, Qualcomm is the only company that's focused on significantly improving per-core performance through a new architecture. Everyone else is either scaling up in core count (NVIDIA) or clock speeds. As we've seen in the PC industry however, generational performance gaps are hard to overcome - even with more cores or frequency.
Qualcomm has an ARM architecture license enabling it to build its own custom micro architectures that implement the ARM instruction set. This is similar to how AMD has an x86 license but designs its own chips rather than just producing clones of Intel processors. Qualcomm remains the only active player in the smartphone/tablet space that uses its architecture license to put out custom designs. The benefit to a custom design is typically better power and performance characteristics compared to the more easily synthesizable designs you get directly from ARM. The downside is development time and costs go up tremendously.
Scorpion was Qualcomm's first Snapdragon CPU architecture. At a high level, it looked very much like an optimized ARM Cortex A8 design although the two had nothing in common outside of instruction set. Scorpion was a dual-issue, in-order architecture that eventually scaled to dual-core and 1.5GHz variants.
Scorpion was pretty much the CPU architecture of choice in the 2009 - 2010 timeframe. Throughout 2011 however, Qualcomm has been very quiet as dual Cortex A9 designs from NVIDIA, Samsung and TI have surpassed it in terms of performance.
Going into 2012, Qualcomm is set for a return to glory as it will be the first to deliver a brand new microprocessor architecture and the first to ship 28nm SoCs in volume. Qualcomm's next-generation SoCs will also be the first to integrate an LTE modem on-die, which should enable LTE on nearly all high-end devices at much better power levels than current multi-chip 4x-nm solutions. Today we're able to talk a bit about the architecture details and performance expectations of Qualcomm's next-generation SoC due out in the first half of 2012.
Krait Architecture
The Krait processor is the heart of Qualcomm's second generation Snapdragon and it's the core of all Snapdragon S4 SoCs. Krait takes the aging base of Scorpion and gives it a much needed dose of adrenaline.
Krait's front end is significantly wider. The architecture can fetch and decode three instructions per clock. The decoders are equally capable of decoding any ARMv7-A instructions. The wider front end is a significant improvement over the 2-wide Scorpion core. It alone will be responsible for a tangible increase in IPC.
Architecture Comparison | ||||||||
ARM11 | ARM Cortex A8 | ARM Cortex A9 | Qualcomm Scorpion | Qualcomm Krait | ||||
Decode | single-issue | 2-wide | 2-wide | 2-wide | 3-wide | |||
Pipeline Depth | 8 stages | 13 stages | 8 stages | 10 stages | 11 stages | |||
Out of Order Execution | N | N | Y | Partial | Y | |||
FPU | VFP11 (pipelined) | VFPv3 (not-pipelined) | Optional VFPv3-D16 (pipelined) | VFPv3 (pipelined) | VFPv3 (pipelined) | |||
NEON | N/A | Y (64-bit wide) | Optional MPE (64-bit wide) | Y (128-bit wide) | Y (128-bit wide) | |||
Process Technology | 90nm | 65nm/45nm | 40nm | 40nm | 28nm | |||
Typical Clock Speeds | 412MHz | 600MHz/1GHz | 1.2GHz | 1GHz | 1.5GHz |
The execution back-end receives a similar expansion. Whereas the original Scorpion core only had three ports to its execution units, Krait increases that to seven. Krait can issue up to four instructions in parallel. The additional execution ports simply help prevent any artificial constraints on ILP. This is another area where Krait will be able to see significant IPC gains.
Krait's fetch and decode stages are obviously in-order, but the back-end is entirely out-of-order. Qualcomm claims that any instruction can be executed out of order, assuming that doing so doesn't create any new hazards. Instructions are retired in order.
Qualcomm lengthened Krait's integer pipeline slightly from 10 stages in Scorpion to 11 stages in Krait. Load/store operations tack on another two cycles and instructions that go through the Neon/VFP path further lengthen the pipe. ARM's Cortex A15 design by comparison features a 15-stage integer pipeline. Qualcomm's design does contain more custom logic than ARM's stock A15, which has typically given it a clock speed advantage. The A15's deeper pipeline should give it a clock speed advantage as well. Whether the two effectively cancel each other out remains to be seen.
Qualcomm Architecture Comparison | ||||
Scorpion | Krait | |||
Pipeline Depth | 10 stages | 11 stages | ||
Decode | 2-wide | 3-wide | ||
Issue Width | 3-wide? | 4-wide | ||
Execution Ports | 3 | 7 | ||
L2 Cache (dual-core) | 512KB | 1MB | ||
Core Configurations | 1, 2 | 1, 2, 4 |
Krait has been upgraded to support the new virtualization instructions added in Cortex A15. Also like the A15, Krait enables LPAE for 40-bit memory addressing.
At a high-level Qualcomm has built a 3-wide, out-of-order engine that feels very much like a modern version of Intel's old P6. Whereas designs from the A8 generation looked like modern Pentiums, Krait takes us into the era of the Pentium II.
Note that courtesy of the wider front-end and OoO execution engine, Krait should be a higher performance architecture than Intel's Atom. That's right, you'll be able to get better performance than some of the very first Centrino notebooks in your smartphones come 2012.
Performance Expectations
Performance of ARM cores has always been characterized by DMIPS (Dhrystone Millions of Instructions per Second). An extremely old integer benchmark, Dhrystone was popular in the PC market when I was growing up but was abandoned long ago in favor of more representative benchmarks. You can get a general idea of performance improvements across similar architectures assuming there are no funny compiler tricks at play. The comparison of single-core DMIPS/MHz is below:
ARM DMIPS/MHz | ||||||||
ARM11 | ARM Cortex A8 | ARM Cortex A9 | Qualcomm Scorpion | Qualcomm Krait | ||||
DMIPS/MHz | 1.25 | 2.0 | 2.5 | 2.1 | 3.3 |
At 3.3, Krait should be around 30% faster than a Cortex A9 running at the same frequency. At launch Krait will run 25% faster than most A9s on the market today, a gap that will only grow as Qualcomm introduces subsequent versions of the core. It's not unreasonable to expect a 30 - 50% gain in performance over existing smartphone designs. ARM hasn't published DMIPS/MHz numbers for the Cortex A15, although rumors place its performance around 3.5 DMIPS/MHz.
Updated VeNum Unit
ARM's NEON instruction set is handled by a dedicated unit in all of its designs. Krait is no different. Qualcomm calls its NEON engine VeNum and has increased its issue capabilities by 50%. Whereas Scorpion could only issue two NEON instructions in parallel, Krait can do three.
Qualcomm's NEON data paths are still 128-bits wide.
Update: Qualcomm published its whitepaper on the Snapdragon S4. Check it out here.
108 Comments
View All Comments
metafor - Friday, October 7, 2011 - link
The SGX540 in the OMAP4460 is clocked significantly higher than (something TI is great at). So, while it isn't a powerhouse compared to Exynos or A5, it's more than sufficient. Google has never traditionally been top-of-the-line in terms of processors with their Nexus series. So it isn't out of the question they wouldn't be this time around either.dagamer34 - Saturday, October 8, 2011 - link
Technically, the Hummingbird SoC used in the Nexus S was top of the line at the time it was released (although it was eclipsed by dual-coe phones after 1-2 months). Also the Nexus One was the first major phone I can remember that had 512MB of RAM.However, it should be said that Google doesn't think about getting the best SoCs available for their product, but instead seeks to get the best deal on it's components from bidding against vendors. It's very possible that Samsung knows they had the most powerful SoC outside of the A5 and wanted Google to pay it for the value it would provide. Google instead apparently went with TI, likely because TI is selling it's chips cheaper in order to be the reference platform for Ice Cream Sandwich.
DanD85 - Friday, October 7, 2011 - link
Funny I saw Adreno 225 having 8 SIMDS and 5 MADS per SIMDS that should be equal to 40 total MADS right? Why it's 80? Am I missing sth?cptcolo - Friday, October 7, 2011 - link
Looking at the chart on "The Adreno 225 GPU" an comparing to the frame rates in the iPad 2 review (http://www.anandtech.com/show/4216/apple-ipad-2-gp... It looks like the PowerVR SGX543MP2 in the iPhone 4S will be about 33% faster. This is a very approximate estimate.cptcolo - Friday, October 7, 2011 - link
"Qualcomm claims that MSM8960 will be able to outperform Apple's A5 in GLBenchmark 2.x at qHD resolutions. We'll have to wait until we have shipping devices in hand to really put that claim to the test, but if true it's good news for Krait as the A5 continues to be the high end benchmark for mobile GPU performance."ssvb - Friday, October 7, 2011 - link
Please correct your CPU features comparison table (and in the previous articles too). ARM11 has a *pipelined* VFP, which actually makes it a lot faster than Cortex-A8 for double precision floating point workloads. You can have a look at the instruction cycle timings to get a better idea:ARM11 VFP - http://infocenter.arm.com/help/topic/com.arm.doc.d...
Cortex-A8 VFP - http://infocenter.arm.com/help/topic/com.arm.doc.d...
Thanks.
Anand Lal Shimpi - Friday, October 7, 2011 - link
Thank you! Fixed :)Take care,
Anand
icrf - Friday, October 7, 2011 - link
An ARM Cortext A9 has an 8 stage pipeline, not 9: http://www.arm.com/files/pdf/ARMCortexA-9Processor...Anand Lal Shimpi - Friday, October 7, 2011 - link
Thanks :)Take care,
Anand
Blaster1618 - Friday, October 7, 2011 - link
Maybe a noob, but I did not know that L0 memory could operate at GHz clock rate (5-10 times that of SD Ram clock rate. Good stuff, keep it coming. B-)