ARM Announces Next Generation 64-Bit Cortex-A72 CPU Design, Mali-T880 GPU

ARM Holdings plc (LON:ARM) has established itself as the ubiquitous king of the smartphone industry.  From Qualcomm, Inc. (QCOM) to NVIDIA Corp. (NVDA) most mobile chipmakers have embraced ARM’s intellectual property core designs, adding value by tweaking them, adding coprocessors, and adding proprietary GPUs to produce a finished SoC.

Even Apple, Inc. (AAPL) — who is eschewing the Cortex-A53/A57 IP cores in the 64-bit era for its own core design (Cyclone) — uses ARM’s instruction set.  In short, it’s hard to escape ARM in the mobile space (sorry, Intel Corp. (INTC)!).

I. Second Generation 64-bit IP Core Design Unveiled

In its never-ending quest to improve mobile performance, ARM on Tuesday announced a new reference spec — the Cortex-A72 IP core design.

Like the Cortex-A57, the Cortex-A72 is aimed at chips with up to eight cores — four of which are higher power core, four of which are the lower power Cortex-A53.  And just like the previous generation model, it uses the ARMv8-A instruction set.  But from there the specs diverge.

The Cortex-A53/A57 big.LITTLE core octacore and hexacore designs were designed to work best with 20 and 28 nm planar transistors.  The open spec allowed Taiwan Semiconductor Manufacturing Comp., Ltd.’s (TPE:2330) (TSMC) and smaller fabs to compete for chipmaking contracts.

The new Cortex-A72 follows a similar approach but is designed to be built on a 16 nm FinFET (FET = Field Effect Transistor) process.  Basically the idea behind the FinFET is to extend a layer of doped silicon in a fin up into the gate.  When a voltage is applied to the gate, it acts on this fin in three dimensions creating a three dimensional set of conducting channels in the fin.  In effect, this technique extends the channel vertically in order to reduce current leakage at smaller feature sizes.  Less leakage means lower power consumption and higher clock speeds.

TSMC refers to its FinFET technology as 16FF+, which stands for 16 nm FinFET Plus.  TSMC currently supports test runs on this node, so it will be the first third party fab to support the new core.  Other fabs (e.g. Samsung Electronics Comp., Ltd.(KRX:005930) (KRX:005935)) should soon have support for 16FF+ online, as well. The target process is a PoP (package on package) design, which bundles different cores or memory units together into an system-on-a-chip (SoC) by affixing upper chips to a lower baseplane chip via ball grid arrays (BGAs).

The senior director of TSMC’s Design Infrastructure Marketing Division, Suk Lee, comments:
TSMC’s 16FinFET+ process is already delivering exceptional results with SoCs based on Cortex-A57 thanks to rapid progress in yield and performance.  The combination of TSMC 16FF+ process technology and the implementation advantages of ARM POP IP gives our customers the opportunity to rapidly bring highly optimized mobile SoCs based on Cortex-A72 to market in early 2016.
Like its predcessors, the Cortex-A57 and Cortex-A15, the Cortex-A72 is intended to be clocked up to 2.5 GHz (although some like Qualcomm may push it to 3.0 GHz and beyond).  While clock speed is in a relative holding pattern improvements to the out-of-order superscalar pipeline and the die shrink are expected to carry substantial improvements to processing power.  ARM claims that the Cortex-A72 will be ~85 percent faster than the Cortex-A57 and roughly 3.5x as fast as the Cortex-A15 (2010).

ARM also claims a 75 percent lower energy consumption than a 28 nm Cortex-A15 for premium workloads.  ARM’s latest big.LITTLE implementation will also debut next year and should cut power consumption for mixed workloads an additional 40-60 percent to deliver total power savings (with the previously mentioned process gains) to 10-15 percent of 28 nm Cortex-A15 levels.

Cortex-A72 will be ARM’s latest addition to its ubiquitous mobile core offerings[Image Source: BBC News]
ARM expects Qualcomm and other IP core licensees to have the new 64-bit core integrated and available in product by sometime next year.  ARM says that in total there’s “more than 10” licensees for the new design.

II. Fourth Generation Midgard — What we Know

In addition to the fresh 64-bit CPU IP core, ARM also announced a new Mali GPU IP core.  The new GPU is dubbed the Mail-T880.  ARM claims it will be 1.6x faster (it bumps up to 850 MHz in stock configuration, versus 650 MHz in the Mali-T760) and 40 percent more power efficient than the Mali-T760.

The Mali-T880 is fourth generation design in the Midgard GPU family.

Currently the biggest clients of ARM’s Mali GPUs are Samsung’s Exynos chips and assorted SoCs from Taiwanese chipmaker MediaTek Inc. (TPE:2454).  Samsung recently forsook Qualcomm’s Snapdragon 810, according to numerous sources.  If that decision holds up, the new Mali GPUs’ highest profile device will will likely be the Galaxy S Series and Galaxy Note Series flagship smartphones from Samsung.  (Qualcomm does not use Mali and instead uses its in-house Adreno; likewise NVIDIA uses its mobile GeForce derivatives.)

Mali-T880 packs up to 16 “shader cores” (SCs), each with three arithmetic pipelines.  Like its predecessor, each SC packs 10 ALUs.  Each SC also has instruction dispatching units, a memory fetcher, and a texture unit.  Looking at the SCs themselves, the Mali-T760 boasted 13 theoretical GFLOPS [source] of processing per core at 650 MHz (22.1 GFLOPS per core based on ARM’s methodology), so based on ARM’s statements (1.8x performance), we can expect around 23.4 theoretical GFLOPS per SC this time around.

In other words we can expect:

  • Mali-T880MP: 23.4 GFLOPS (1 shader core)
  • Mali-T880MP2: 46.8 GFLOPS (2 shader cores)
  • Mali-T880MP4: 93.6 GFLOPS (4 shader cores)
  • Mali-T880MP6: 140.4 GFLOPS (8 shader cores)
  • Mali-T880MP16: 374.4 GFLOPS (16 shader cores)

Recall that NVIDIA’s recently announced Tegra X1 packs 256 CUDA cores and boasts 512 GFLOPS of computer performance for FP32 (32-bit standard-precision floating point).  But be aware, that’s at a much higher core clock. At a comparable core clock (which would yield somewhat similar power usage, the Tegra X1 is sitting pretty at around 435 GFLOPS of peak compute.  That indicates that each Mali-T880 shader core equates to roughly CUDA Cores.  If accurate, that means that each Mali-T880 shader core has the compute potential of 18.6 CUDA cores.

So, roughly speaking:
1 Mali-T880MP (@ 850 MHz) ⇔ (roughly) 19 CUDA cores (@ 1 GHz)

According to NVIDIA’s white paper [PDF], the Tegra X1 has 64 KB x 4 (256 KB total) of L1 cache; 512 KB of register space; and 2 MB of L2 cache to split amongst its cores.  Each quartet of Mali-T880MP cores gets 256-512 KB of L2 cache.  Thus the total L2 cache in fully loaded sixten SC configuration will be equivalent to the Tegra X1’s, at 2048 KB (2 MB) for a module with sixteen SCs.  So the Mali-T880 seems to have a lot of compute on its hands, but may have trouble feeding its cores for lack of L2 cache (compared to GeForce).

The GPU layout of the Tegra X1 (click to enlarge).
There are some commonalities between Mali-T880MP16 and the GeForce GPU found in the Tegra X1.  Both appear to have 16 texture units.  It’s unclear how the texturing logic is laid out in the current Adreno pipeline.

A key challenge for ARM’s Mali-T880 will be matching the Tegra X1’s impressive texture draw rate.  After drawing criticism (whoops, pun) in the Tegra K1 for a relatively poor rate of texturing NVIDIA has greatly improved things, doubling its texturing rate with Tegra X1, while being conservative in its computing power estimate.

To explain, NVIDIA was criticized in the Tegra K1 era for claiming a lot of GFLOPS of peak compute power (365, to be precise), but then seemingly unable to live up to its hype in the real world with a 7.6 GTexel/s texturing rate.  The Tegra K1 had a claimed peak GFLOP/GTexel rate of roughly 48.  Higher numbers in this case are bad as they represent that the texturing isn’t keeping up with theoretical compute.

Sony Corp.’s (TYO:6758) PlayStation 3’s RSX GPU, for example, only claimed 192 peak GFLOPS, but could fill 12 GTexel/s [source].  So the PS3 had a three-times lower ratio of 16 peak claimed GFLOP/GTexel –which showed its claims were more realistic.

The Tegra X1’s ratio is 32 peak claimed GFLOP/GTexel — midway between the Tegra K1 and PS3.  That’s actually quite an impressive improvement.  The Mali-T760 was actually pretty impressive in this regard in that it had a ratio of 33 peak claimed GFLOP/GTexel [source].  The big question is whether the Mali-T880 can at least hold this number flat, bumping its texture draw rates enough to keep up with its compute capability claims (as NVIDIA has).

Turning to Adreno, the Adreno 430 onboard the Qualcomm Snapdragon 810, reportedly packs 288 shader cores (ALUs, technically speaking), which max out at 600 MHz for a compute power of 388.8 GFLOPS.  Using the same approach as with the Tegra X1, that leads us to conclude that 1 Mali-T880 SC is the equivalent of 18.7 Adreno 430 cores.

The Qualcomm Adreno 430 (onboard the Snapdragon 810) will compete with the Mali-T880.
So, to recap:
1 Mali-T880 SC (@ 850 MHz) ⇔19 Adreno 430 shaders (@ 600 MHz) ⇔ 19 CUDA cores (@ 1 GHz)

The Adreno 420 had 128 KB of L1 cache and 2 MB of L2 cache.  Qualcomm has not yet announced the L1 cache, L2 cache, or register allotment for the Adreno 430, but it is presumably staying at 2 MB for the L2 cache (although an increase to 4 MB remains possible).

The Adreno 330 (found in the Snapdragon 800) at 450 MHz has a ratio of roughly 39 GFLOP/GTexel (w/ a real world performance of of 3.3 GTexel/s and 129.8 peak GFLOPS).  The Adreno 430 scores ~8.5 GTexel/s in GFXBench.  So assuming that’s the peak clock speed, that works out to a ratio of around 46 — not very good.

Overall ARM’s “triple pipe” shader core design looks fairly competitive with NVIDIA’s GeForce derivatives and Adreno, although it may be a little short on cache.  The GPU also improves standards support, slightly, supporting more DirectX 11.1 features and supporting OpenGL ES 3.1.

III. The “IP Suite”

Turning finally to ARM’s “other” products, ARM also announced a display coprocessor, the Mali-DP550.  This coprocessor claims native 4K resolution display support.  It supports outputting up to 12-bits of color information per pixel.

Are 4K displays overkill for smartphones?  Perhaps, but in the phablet and tablet space (which the new platform also targets), they may be a more reasonable offering.  The display chip also supports compositing up to seven layers and supports 3D displays.

The Mali-DP550 is paired with the Mali-V550, a multicore imaging coprocessor that targets 4K video taking and low-power full HD (FHD) (aka 1080p) video encoding.  The Mali-V550 offers native support for 10-bit YUV color from the sensor.

Using one core it can encode 1080p video at 60 fps.  Using all eight cores (a more power intensive command), it can encode 4K video at up to 120 fps.

Finally, ARM introduced a new interconnect chip in the CoreLink family.  ARM calls this new chip the CoreLink CCI 500.

CCI stands for “cache coherent interconnect”, which refers to its role of handling communications between the CPU, GPU, imaging coprocessor, and display coprocessors.

The CCI supports a variety of new standards, including LPDDR4.  It also provides a pool of L3 cache that shared between these units to accelerate the performance.  ARM claims the CCI enhances memory performance in SoCs by 30 percent.  It offers this infographic to explain the role of the Corelink CCI 500:

Together the ARM Cortex-A72/A53 CPU clusters, Mali-T880 GPU, Mali-V550 imaging coprocessor, Mali-DP550 display processor, and CoreLink CCI 500 interconnect chip form what ARM is billing the “premium experience IP suite”.  Essentially this is a ready made SoC that will only need tiny tweeks and perhaps a couple additions (e.g. a baseband coprocessor, would be nice).

Among the OEMs looking to capitalize on this opportunity is MediaTek. MediaTek SVP Joe Chen commented in the press release:
The pace of innovation in mobile is accelerating at an unprecedented rate, which means we need to deliver the latest technology to our customers as fast as possible.  We are pleased to partner with ARM for the launch of Cortex-A72, bringing the ARMv8-A architecture to market with leading performance and energy-efficiency benchmarks. Ultimately it is all about providing a better experience for end users as the complexity of applications, content and devices increases.
Privately held Chinese fabless SoC maker Rockchip (Fuzhou Rockchip Electronics Co., Ltd.) also expressed interest in the suite.  Rockchip is a rising star and may soon pop up in product in the U.S. much as we are now seeing MediaTeck chips enter the product stream.

For mobile device fans, 2016 will be an exciting year and it should be interesting to see how ARM’s latest and greatest IP core offerings perform in the real world. We should get a taste of that performance courtesy of partner OEMs, including Samsung, MediaTek, and Rockchip, while others like NVIDIA and Qualcomm may embrace the CPU IP core solely in offerings, as well.