Cloud native EDA tools & pre-optimized hardware platforms
Gary Ruggles, Sr. Product Marketing Manager, Synopsys
The massive growth in the production and consumption of data, particularly unstructured data, like images, digitized speech, and video, is resulting in a huge increase in the use of accelerators. According to the Bank of America Merrill Lynch Global Semiconductors Report from October 2, 2016, “an estimated accelerator TAM of $1.64B in 2017 is expected to grow beyond $10B in 20201.” This trend towards heterogeneous computing in the data center means that, increasingly, different types of processors and co-processors must work together efficiently, while sharing memory. This disaggregation can cause systems to experience significant bottlenecks due to the use of large amounts of memory on accelerators, and the need to share this memory coherently with the Hosts to avoid unnecessary and excessive data copying.
Compute Express Link (CXL), a new open interconnect standard, targets intensive workloads for CPUs and purpose-built accelerators where efficient, coherent memory access between a Host and Device is required. A consortium to enable this new standard was recently announced simultaneously with the release of the CXL 1.0 specification. This article describes some of the key features of CXL that system-on-chip (SoC) designers need to understand in order to determine the best use and implementation of this new interconnect technology into their designs for AI, machine learning, and cloud computing applications.
PCI Express (PCIe) has been around for many years, and the recently completed version of the PCIe base specification 5.0 now enables interconnection of CPUs and peripherals at speeds up to 32GT/s. However, in an environment with large shared memory pools and many devices requiring high bandwidth, PCIe has some limitations. PCIe doesn’t specify mechanisms to support coherency and can’t efficiently manage isolated pools of memory as each PCIe hierarchy shares a single 64-bit address space. In addition, the latency for PCIe links can be too high to efficiently manage shared memory across multiple devices in a system.
The CXL standard addresses some of these limitations by providing an interface that leverages the PCIe 5.0 physical layer and electricals, while providing extremely low latency paths for memory access and coherent caching between host processors and devices that need to share memory resources, like accelerators and memory expanders. CXL’s standard modes supported are focused around a PCIe 5.0 PHY operating at 32GT/s in a x16 lane configuration (Table 1). To allow for bifurcation, 32GT/s is also supported in x8 and x4 lane configurations. Anything narrower than x4 or slower than 32GT/s is referred to as a degraded mode, which is clearly not expected to be common in target applications. While CXL can offer significant performance advantages for many applications, some devices don’t need the close interaction with the Host, and primarily need to signal work submission and completion events, often while working on large data objects or contiguous streams. For such devices, PCIe works quite well for an accelerator interface, and CXL offers no significant benefits.
Table 1: CXL supported operating modes
The CXL standard defines 3 protocols that are dynamically multiplexed together before being transported via a standard PCIe 5.0 PHY at 32 GT/s:
The CXL.io protocol is essentially a PCIe 5.0 protocol with some enhancements and is used for initialization, link-up, device discovery and enumeration, and register access. It provides a non-coherent load/store interface for I/O devices.
The CXL.cache protocol defines interactions between a Host and Device, allowing attached CXL devices to efficiently cache Host memory with extremely low latency using a request and response approach.
The CXL.mem protocol provides a host processor with access to Device-attached memory using load and store commands with the host CPU acting as a requester and the CXL Device acting as a subordinate and can support both volatile and persistent memory architectures.
As can be seen in Figure 1, the CXL.cache and CXL.mem are combined and share a common link and transaction layer, while the CXL.io has its own link and transaction layer.
Figure 1: Block diagram of a CXL device showing PHY, controller and application
The data from each of the three protocols are dynamically multiplexed together by the Arbitration and Multiplexing (ARB/MUX) block before being turned over to the PCIe 5.0 PHY for transmission at 32GT/s. The ARB/MUX arbitrates between requests from the CXL link layers (CXL.io and CXL.cache/mem) and multiplexes the data based on the arbitration results, which use weighted round-robin arbitration with weights that are set by the Host. The ARB/MUX also handles power state transition requests from the link layers, creating a single request to the physical layer for orderly power-down operation.
CXL transports data via fixed-width 528-bit flits, which are comprised of four 16-byte slots with a two Byte CRC added: (4 x 16 + 2 = 66 Bytes = 528 bits). Slots are defined in multiple formats and can be dedicated to the CXL.cache protocol or the CXL.mem protocol. The flit header defines the slot formats and carries the information that allows the transaction layer to correctly route data to the intended protocols.
Since CXL uses the PCIe 5.0 PHY and electricals, it can effectively plug into a system anywhere PCIe 5.0 could be used via Flex Bus, which is a flexible high-speed port that can be statically configured to support either PCIe or CXL. Figure 2 shows an example of the Flex Bus link. This approach enables a CXL system to take advantage of PCIe retimers; however, currently, CXL is defined as a direct-attach CPU link only, so it cannot take advantage of PCIe switches. As the standard evolves, switching functionality may get added to the standard; if so, new CXL switches will need to be created.
Figure 2: The flex Bus. link supports Native PCIe and/or CXL Cards
Since the CXL.io protocol is used for initialization and link-up, it must be supported by all CXL devices, and if the CXL.io protocol goes down, the link cannot operate. The different combinations of the other two protocols result in a total of three unique CXL device types that are defined and can be supported by the CXL standard.
Figure 3 shows the three defined CXL device types along with their corresponding protocols, typical applications, and the types of memory access supported.
Figure 3: Three defined CXL device types
For Type 2 Devices CXL has defined two coherency “Biases” that govern how CXL processes the coherent data between Host - and Device-attached memory. The bias modes are referred to as Host bias and Device bias, and the operating mode can change as needed to optimize performance for a given task during operation of the link.
When a Type 2 Device (e.g., an accelerator) is working on data between the time-of-work submission to the Host and its subsequent completion, the Device bias mode is used to ensure that the Device can access its Device-attached memory directly without having to talk to the Host’s coherency engines. Thus, the Device is guaranteed that the Host does not have the line cached. This gives the best latency performance possible to the Device, making Device bias the main operating mode for work execution by the accelerator. The Host can still access Device-attached memory when it’s in Device bias mode, but the performance will not be optimal.
The Host bias mode prioritizes coherent access from the Host to the Device-attached memory. It is typically used during work submission when data is being written from the Host to the Device-attached memory, and it is used for work completion when the data is being read out of the Device-attached memory by the Host. In Host bias mode, the Device-attached memory appears to the Device just like Host-attached memory, and if the Device requires access, it is handled by a request to the Host.
The bias mode can be controlled using either software or hardware via the two supported mode management mechanisms, which are software-assisted and hardware autonomous. An accelerator or other Type 2 Device can choose the bias mode, and if neither mode is selected, the system defaults to the Host bias mode such that all accesses to Device-attached memory must be routed through the Host. The bias mode can be changed with a granularity of a 4KB page and is tracked via a bias table implemented within the Type 2 Device.
An important feature of the CXL standard is that the coherency protocol is asymmetric. The Home caching agent resides only in the Host. Thus, the Host controls the caching of memory, which resolves system wide coherency for a given address from the attached CXL Device requests. This is in contrast to existing proprietary and open coherency protocols that are in use, particularly those for CPU-to-CPU connection, as they are generally symmetric, making all interconnected devices peers.
While this has some advantages, a symmetric cache coherency protocol is more complex, and the resulting complexity has to be handled by every Device. Devices with different architectures may take different approaches to coherency that are optimized at the micro-architecture level, which can make broad industry adoption more challenging. By using an asymmetric approach controlled by the Host, different CPUs and accelerators can easily become part of the emerging CXL ecosystem.
One can potentially envision several protocols being used together within large systems with memory coherency to handle CPU-to-CPU, CPU-to-attached-Device, and longer distance chassis-to-chassis requirements. Currently, CXL is focused on providing an optimized solution for servers. Its inherent asymmetry means it may not be ideal for CPU-to-CPU or accelerator-to-accelerator connections. Additionally, with its reliance on PCI 5.0 PHYs, a different transport may be better suited to improve performance for rack-to-rack installations.
With CXL tightly coupled to PCIe 5.0, we expect to see products supporting CXL arriving in the same timeframe as PCIe 5.0. In an editorial from March 11, 2019, Intel says that they “plan to release products that incorporate CXL technology starting in Intel’s 2021 data center platforms, including Intel® Xeon® processors, FPGAs, GPUs and SmartNICs.”2
The CXL consortium has recognized the need to support some kind of interoperability and compliance program to help with the adoption of the standard. As a result, some minor updates will be made to the specification to help support this requirement, and, eventually a compliance program seems likely to be put in place.
The CXL standard, which seems to be gaining traction very rapidly, offers benefits for devices that need to efficiently work on data while sharing memory coherently with a host processor. At the time of this article, membership in the CXL consortium has reached 75 members and is still growing. With a major CPU provider, Intel, backing the standard and rolling out CXL-enabled systems in 2021, it seems likely that there will be significant industry adoption.
Contact us for more information on CXL.
References:
1. Bank of America Merrill Lynch Global Semiconductors Report, October 2, 2016