IBM’s POWER9 Microprocessor: The Basis for Accelerated Systems

By Joe Clabby, Clabby Analytics

Over the past several years, all major server makers have focused on redesigning their server architectures in order to eliminate data flow bottlenecks – building what have become known as “accelerated systems”. Likewise, the major developers of database software have also found new ways to speed database processing. This combination has created a data-processing panacea for fast, efficient and affordable processing of very large databases (Big Data).

As far back as early 2013, Clabby Analytics started to report on the progress being made in accelerated system designs (our reports on VelociData, The Now Factory, IBM’s PureData system, IBM’s DB2 Analytics Accelerator and IBM Power Systems accelerated designs are available upon request).   Accelerated system designs focus on speeding workload processing by improving system throughput. Improved throughput is achieved in a variety of ways including increasing internal bus speed, reducing communications overhead, off-loading tasks to other types of processors, improving memory management, reducing I/O drag, tuning execution methods, creating new interfaces that streamline peripheral access to processing power, and more.

What intrigued us about the early designs was that several systems were making use of field programmable gate arrays (FPGAs) to accelerate data transfer (processing data at line speed); and some designs were also making use of graphical processing units (GPUs) to accelerate parallel processing. We also found that the makers of general purpose CPUs (especially POWER and Intel processors) were more focused than ever before on processing large amounts of data (Big Data).

In 2014 we also started tracking database accelerators (our report on IBM’s DB2 BLU Acceleration vs. SAP’s Hana vs. Oracle’s Exadata is available upon request). What we observed was that new algorithms combined with new data processing methodologies were being used to accelerate database processing speed. Also, in 2014, we noted that sales of in-memory database software from IBM, SAP, Oracle, Altibase, Exasol, Kognito, McObject, ParStream, Quartet FS, VMware and VoltDB were on the rise – helping further accelerate the processing of large amounts of data (see this report for further details).

Today, in 2016, we’re starting to see one systems vendor pull ahead of its competitors when it comes to architecting accelerated systems. In a recent briefing, IBM shared with us its plans for further accelerating its current POWER8-based servers – and IBM also shared its future plans for its next generation POWER9 architecture.

Why Accelerated Systems?

Traditional server designs can be used to process Big Data, technical and cognitive workloads. But the bigger question is: “how efficiently?” Accelerated systems can operate exponentially faster than traditional designs – and this speed advantage results in the ability to process more and more data at a significantly lower cost.

Anecdote: When the computer system salesman told an IT executive that his new system could process data exponentially faster than the current system – the executive said “I don’t care about ‘exponentially faster’ claims”. The salesman, not missing a beat, then said “Okay – let me rephrase: How would you like to use significantly fewer systems and pay for significantly fewer software licenses in order to process your workloads?” The IT executive was suddenly more interested.

IBM POWER8 Accelerated Systems

In June, 2014, Clabby Analytics took a close look at IBM’s POWER8 architecture when we compared it to Intel’s E5 v2 architecture (see this report). We found that both microprocessor/server environments had been designed to process Web, file and print, email, database, vertical-specific applications, high performance computing and cloud workloads.

We also found that POWER8 processors were more efficient than a E5 Xeon competitor (due to processing and bandwidth advantages, POWER8-based servers were able to deliver results more quickly). When we looked at IBM’s Power Systems designs, we found that POWER8-based servers were better suited for data-intensive environments from a performance and price/performance perspective (POWER8-based servers cost less than E5 v2-based competitors due to aggressive IBM pricing and numerous efficiency advantages).

Today, as we look more closely at Power8-based system designs, we find that that IBM has further accelerated its Power Systems by:

  • Using fast – and sometimes different – processors to handle specific workloads;
  • Enhancing memory access using crypto and memory expansion – and by introducing transactional memory;
  • Increasing on-chip cache sizes and memory buffers to make it possible to place data closer to central processing units (CPUs) and graphical processing units (GPUs) where that data can be processed more quickly;
  • Increasing internal bus speed to accelerate data flow within a server;
  • Streamlining input/output (I/O) access to processors using interfaces such as CAPI and RDMA (and will soon use NVLink) to reduce I/O overhead and simplify I/O device to processor data movement;
  • Streamlining networking subsystems to speed communications;
  • Using virtual addressing to allow accelerators to use the same memory that processors use in order to remove operating system and device overhead; and,
  • Introducing hardware managed cache coherence.

With the introduction of POWER8, IBM placed PCIe Gen 3 logic directly on the chip, then built an interface to this logic known as the coherence attached processor interface (or CAPI). CAPI is a customizable hardware accelerator that enables devices, Flash and coprocessors to talk directly and at very high speeds with POWER8 processors. Examples of CAPI-based solutions include IBM’s Data Engine for NoSQL (which allows 40GB of flash storage to be used like extended memory); DRC Graphfind analytics; and Erasure Code Acceleration for Hadoop.

The reason this CAPI interface is so important is because it eliminates the need for a PCIe bridge, as well as the need to launch the thousands of operating system and driver instructions (perhaps as many as 22.5K instructions) that are run every time PCIe I/O resources are used. Instead, the logic for driving I/O resides on the chip where the number of instructions are vastly reduced and the speed of interactions between the CPU and associated hardware devices is dramatically improved.

Later this year IBM will introduce a new interface known as NVLink to its POWER8 processor. Created by NVIDIA, NVLink tightly links NVIDEA GPUs with POWER8, enabling POWER8 and NVIDEA GP100 GPUs to jointly process data. This new interface is 5X to 12X faster than PCI Gen 3 connections – leading to even more rapid processing of data-intensive workloads.

Another method that IBM Power Systems use to accelerate data flow is known as remote direct memory access (RDMA) which enables data to be moved from the memory/storage of one system to the memory or storage of another system – at line speed. The types of workloads that benefit most from the use of RDMA are network-intensive applications that suffer from bandwidth/latency-related data retrieval issues. These include:

  • Large scale simulations, rendering, large scale software compilation, streaming analytics and trading decisions – the kinds of applications found most often in massively parallel, high performance computing (HPC);
  • Hyper-appliance, hyperconverged and hyperscale environment where large volumes of data needs to be moved between servers and associated storage; and,
  • Workloads where network latency slows database performance and interferes with virtual machine (VM) density.

IBM’s accelerator-attach technologies such as CAPI and NVLink feed data to POWER8 exponentially faster than previous generation interconnects such as PCIe. To us, IBM appears to be more aggressive in accelerator-attach interfaces than its competitors, and – accordingly – we see this as a distinct competitive differentiator.

Also Noteworthy: Memory Activities

When POWER8-based systems were originally introduced, IBM announced that memory channel speed has been improved – again enabling more data to be delivered more quickly to processors, thus accelerating the processing of large volumes of data. IBM also announced its Durable Memory Interface (DMI).   This interface is agnostic, meaning it enables different types of memory to be attached to the bus (so Power Systems are no longer only tied to DRAM memory).

What POWER9 Means to the Future of Accelerated Systems

Just over a year from now (in the second half of 2017), IBM will introduce its next generation POWER9 processors. POWER9 will consist of a family of chips that will be focused on 1) analytics and cognitive; 2) new opportunities in cloud and hyperscale; and, 3) technical computing.

Some of the improvements that can be expected when POWER9 is delivered include:

  • 24 cores;
  • Improvements to IBM’s Vector Scaling eXtension (VSX) – which improves floating point extensions (especially useful for cryptographic operations);
  • Improvements in branch prediction and a shorter pipeline;
  • The use of “execution slices” to improve performance;
  • The use of large, low-latency eDRAM cache to accommodate big datasets;
  • State-of-the-Art I/O speed (by leveraging PCI Gen4) – giving POWER9 3X faster bandwidth to access I/O and storage as compared with POWER8;
  • A new 25Gbps advanced accelerator bus;
  • On-chip compression and cryptographic accelerators;
  • Access to next generation NVLink 2.0 to increase speed by 33%, plus coherence across GPUs and CPUs memory to enhance usability of GPUs; and,
  • Optimization for 2 socket servers using direct attached DDR4 direct attach memory channels.POWER9-based servers will be delivered in multiple different scale-up and scale-out form factors. And IBM, as could be expected, will continue to concentrate on making its next generation POWER processors energy efficient, high in security features and rich in Quality-of-Service functionality (such as high availability and resiliency).

Summary Observations

In days gone by, systems makers focused on increasing processor speed in order to improve processing performance. For almost 50 years, the number of transistors on a processor doubled about every two years (known as “Moore’s Law) – so improving system performance constantly centered on improved processing power.

But in the late 2000s, processors reached their peak in terms of the number of transistors that could be placed on a chip, forcing systems designers to focus more heavily on tuning memory and input/output (I/O) device access, on improving internal bus speed, on improving communications and on streamlining memory access in order make systems perform more quickly. These highly tuned systems have become known as “accelerated systems”.

IBM has been particularly aggressive in driving breakthroughs that attack bandwidth-limiting bottlenecks. It has created hardware accelerators; it has embedded accelerators at the chip level; it has implemented its own hardware interface (CAPI); and it is using NVIDIA’s NVLink to help connect GPUs directly to its POWER8 and POWER9 processors. From a memory perspective, it uses RDMA to speed memory access; it has increased memory channel bandwidth – and now it will soon offer its customers the opportunity to use multiple different types of memory in its Power Systems. POWER8 offers large memory caches – as will POWER9.

It should be clear to readers that Power Systems are all about moving data efficiently. IBM has focused strongly on overcoming latency issues and other bottlenecks – delivering some of the most powerful servers in the industry for data processing. Meanwhile, an entire ecosystem has grown-up around POWER8 – enabling major vendors from around the world to contribute new, innovative solutions to the POWER ecosystem.

As we look at IBM’s competitors, we see all leading systems makers working on accelerated system designs. However, we see several distinct differentiators in IBM’s Power Systems lines including several accelerator-attach technologies, high performance and strong cost/performance. We also see contributions and innovations introduced through the OpenPOWER Foundation as a major differentiator. From our perspective, IBM’s aggressive efforts in server acceleration and in database acceleration – combined with its broad portfolio of analytics software – make the company a formidable competitor in the world of accelerated, Big Data servers.

IBM’s mantra for POWER8 = was “designed for data” – meaning that several of its central features were designed to accelerate the processing of large databases. With POWER9, we believe the new mantra should be “designed for accelerated systems leadership” because the new POWER9 processor and system designs should significantly outperform competing Intel Xeon E7, and Oracle Sparc M and Sparc64 system designs. For enterprises planning their future Big Data strategies, it is time to become very familiar with IBM’s fast approaching POWER9 architecture.