The rapidly evolving landscape of high-performance computing is significantly influenced by the exponential growth in data volume and complexity. This evolution is propelled by the increasing number of devices per user, each demanding higher bandwidth, thereby escalating the need for more robust data transfer capabilities. In response, PCI Express 6.0 has emerged as an answer to these demands, offering enhanced I/O density and bandwidth to manage this burgeoning data load. However, this advancement comes with heightened power demands, especially notable in multi-port AI accelerators and switch-heavy configurations, leading to potential efficiency challenges in data centers. PCIe 6.0 aims to balance these demands with improved power management strategies, including new power states like L0p, which align power usage with bandwidth requirements. PCIe 6.0 also introduced PAM-4, a multi-level signaling technology that transmits two bits per unit interval, doubling the capacity of the previous NRZ's one-bit transmission. However, PAM-4 signaling also introduces signal integrity challenges like crosstalk, signal reflections and increased power consumption. As the industry navigates these advancements, managing power consumption and minimizing latency are becoming increasingly vital. This article will delve into the complexities of PCIe latency and power considerations, exploring strategies to optimize these critical aspects in HPC SoC design.

Strategies for Power Reduction in PCI Express 6.0

Reducing power consumption in PCI Express primarily involves addressing the PHY, the system's major power consumer, responsible for about 80% of total usage. Several strategies have been implemented to tackle this:

  1. L1 substates (L1.1 and L1.2) significantly cuts PHY power
  2. Adjusting the transmitter swing or disabling floating taps helps in short-range applications,

PCIe link power management with specified L1 substates, along with techniques like TX swing or disabling floating taps, effectively controls idle power consumption. Additionally, the focus is shifting towards optimizing power in digital logic, a key differentiator in mobile markets. 

Furthermore, PCIe has developed features like Dynamic Power Allocation (DPA), Latency Tolerance Reporting (LTR), and Optimized Buffer Flush/Fill (OBFF) to reduce system-wide active power consumption. DPA allows for efficient power management across Endpoint Functions, especially in active states. LTR facilitates communication of latency requirements to the root complex, enabling more effective power management of central resources. OBFF, meanwhile, helps endpoints adjust traffic to minimize power impact, allowing the root complex to signal optimal times for traffic to reduce overall system power consumption.

Subscribe to the Synopsys IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Reducing Power with L1 and L2 States

Figure 1: L1 Substrates: Optimizing system power in Stand-By and Deep Energy-Saving Mode

PCI Express uses different power link state levels to manage power consumption effectively. These levels range from L0 (fully active) to L3 (link off state), with intermediate states like L0S/L1 (electrical idle or standby) and L1 (lower power standby or sleep state). Techniques for power reduction include periodic checks for electrical idle exit and powering off unnecessary circuits. The L1 state has substates L1.1 and L1.2, designed to reduce power consumption with varying exit latencies. L2 state turns off all clocks and main power supplies, offering the highest power savings but with longer exit latencies.

Using Clock-Gating to Reduce Activity

To enhance power efficiency, techniques like clock gating are employed, which involves temporarily disabling unused circuitry. This approach is particularly effective in multi-port devices, where deactivating the clock in inactive ports can lead to significant power savings. Despite these measures, it's important to note that CMOS devices still consume a small amount of power due to leakage current, needing both local and global clock gating strategies to optimize power consumption across the system. PCI Express 6.0 features the L0p state, designed to minimize power usage by maintaining necessary traffic with fewer active components, ideal for situations where full bandwidth is not required. Non-active lanes can be clock-gated to save dynamic power.

Power-Gating Enables L1 Substates

Power gating is a key technique in chip design, particularly for chips with smaller geometries, aimed at reducing leakage and switching power by shutting down inactive parts of the chip. This method is essential for enabling L1 substates in PCI Express (PCIe), significantly lowering power consumption, often by 75% to 85% in standard designs. Implementing power gating, however, involves complexities such as integrating a power controller, a switching network, isolation cells, and retention registers. It also requires adherence to the IEEE 1801 Unified Power Format (UPF) standard, ensuring that the design and verification of low power integrated circuits are up to industry standards. This approach is especially crucial in designs focused on maintaining low idle power levels, like those in battery-operated systems.

L0p: New State to Support Scalable Power Consumption

Figure 2: L0P in PCIe 6.0

The L0p power state in PCI Express 6.0 greatly improves power efficiency by aligning power usage with actual bandwidth needs. In systems with a 16-lane PCI Express 6.0 link, bandwidth requirements can vary, and not all the 64G transfer rate is always needed. L0p enables dynamic scaling of active lanes, reducing power consumption without the need for renegotiating link width, a necessity in previous standards. This feature ensures a seamless and uninterrupted data flow even during bandwidth changes. FLIT mode, where data is transmitted in 256-byte units, is integral to the L0p state. Once in FLIT mode, a device maintains this mode irrespective of signal changes or data rate fluctuations. L0p allows for bandwidth adjustment at either end of a link, accommodating any of the supported PCIe link widths. However, it requires equal scaling for both sending and receiving ends, meaning the entire link must operate at the bandwidth needed by the higher-demand direction. This capability of L0p ensures uninterrupted data transfer and can lead to power savings of several picojoules per bit in a by-16 link configuration.

Power-Reduction by AVS or DVFS

Figure 3: Lowering the Voltage to Lower Power

Dynamic voltage frequency scaling (DVFS) and its counterpart, adaptive voltage scaling (AVS), are effective techniques for reducing power consumption in chip design, particularly impacting active power consumption significantly. These methods allow for real-time adjustments in voltage and frequency based on performance needs, effectively reducing both dynamic and static power. In AVS, voltage levels are adjusted in fixed steps, while DVFS relies on frequency-based tables for voltage adjustments. Chips are usually designed with margins for worst-case scenarios, allowing performance maintenance even with reduced voltage, leading to considerable power savings. These techniques can enhance dynamic power efficiency by 40% to 70% and leakage power efficiency by 2 to 3 times. However, their implementation adds complexity and requires a careful balance between performance and power consumption, considering factors like level shifters, power-up sequences, and clock scheduling.

Waiting is the Hardest Part

Latency is a critical concern in various applications, from Zoom calls, where delays can hinder communication, to gaming, where even minor lags can impact performance. In the realm of high-performance computing, where systems are becoming increasingly complex with more CPU interactions, addressing latency is essential. Maintaining high link quality is a key strategy to mitigate latency, as poor quality can increase bit error rates, leading to more frequent data replays and effectively doubling transmission time. Innovations such as embedded switches in complex SoCs significantly reduce latency by creating embedded endpoints within the chip, thereby eliminating the need for external data transmission and greatly shortening response times. Efficient management of tags for non-posted requests and flow control credits also plays a vital role in preventing system stalls, which can inadvertently increase latency. Additionally, minimizing lane-to-lane skew, which may require extra buffering and thus add to latency, is crucial. Using components like switches and retimers judiciously is essential in system design to balance these factors. CXL (Compute Express Link) also emerges as a promising option in certain applications, offering lower latency alternatives compared to traditional PCI Express solutions.

Maintaining Link Quality

Figure 4: Synopsys PCIe 6.0 IP PAM-4 Eyes at 64 GT/s, showcasing a high link quality, which reduces system latency through fewer replays and fewer errors

Maintaining high link quality is crucial in reducing system latency by minimizing errors and data replays, yet not all influencing factors are within the control of SoC designers. Key controllable elements include adhering to PCI Express electrical specifications for transmitters and receivers, and efforts to reduce TX jitter, increase RX jitter tolerance, and improve bit error rates are vital in lowering latency. Package design also plays a significant role in maintaining signal quality by minimizing discontinuities and crosstalk, thereby reducing errors and replays. The impact of board, connector, and cable design on system latency, both through direct and indirect means such as signal loss, is another critical aspect. However, the use of components like retimers or redrivers, while beneficial for signal quality, can paradoxically increase latency, therefore, it's a delicate balance for designers to enhance signal integrity without unduly adding to the system's overall latency.

Embedded Endpoint Controllers in Embedded Switches

Figure 5: Synopsys Optimized EEP Compared to PIPE-to-PIPE Design

Embedded endpoints present a notable opportunity for latency reduction in certain scenarios, although they may be seen as niche solutions. For example, in a typical controller with a pipe-to-pipe implementation, the application side connects directly to the pipe interface, bypassing the need for a serial PHY and thus reducing latency. However, standard setups still require connections to an endpoint controller and possibly an AMBA bridge, involving various logic blocks. By transitioning to an optimized embedded endpoint, much of this logic can be eliminated, significantly impacting latency. Comparisons between traditional pipe-to-pipe implementations and optimized embedded endpoints indicate a potential 70% reduction in latency, a substantial improvement within a specific system segment. Although not widely adopted or known due to its evolution with the PCI Express specification, this approach holds considerable value, especially in designs with embedded switches and endpoints. 

Faster Clocking

Boosting the clock speed in systems, for example from 1 Gigahertz in PCI Express 5.0 to 2 Gigahertz, can theoretically cut latency in half, reducing a cycle time from 1ns to 0.5ns and potentially bringing a 20ns latency down to 10ns. However, the actual reduction in latency might not be as significant since not all latency components scale proportionally with clock speed. Increasing the clock speed also introduces design challenges, such as the need for more pipeline stages to meet timing demands, which can ironically increase latency. Additionally, higher clock speeds often require higher voltages, leading to increased power consumption.  These considerations make faster clocking a complex trade-off. Designers must balance the benefits of reduced latency against the increased power consumption and potential for added latency through additional pipelining and rate adaptation. Ensuring compatibility across all components, like the PHY and controller, and the need for rate adapters, which could introduce additional latency, are other crucial factors. While faster clocking can improve throughput, how it's implemented, especially in terms of adding register stages, may not always lead to the desired outcome in terms of throughput efficiency.

Pipelining

In digital systems, particularly controllers, pipelining is used to boost throughput and facilitate higher clock speeds, like moving from 1 to 2 Gigahertz. This process involves adding extra pipeline stages, which can paradoxically increase latency as data must pass through more stages. However, in some cases, this can actually reduce overall latency. For instance, if a system at 1 Gigahertz takes 25 clock cycles (25ns) to complete a task, doubling the clock speed to 2 Gigahertz might add 6 extra cycles, but each cycle is now faster, at 0.5ns. This change could reduce the original 25ns latency down to 15.5ns. Despite this potential benefit, achieving the right balance between clock speed and pipeline stages is complex and requires careful planning to ensure it effectively reduces latency without adversely affecting other system aspects.

Other Latency Reduction Techniques

Figure 6: Analog and Digital Datapath optimization must be considered for the overall PCIe latency optimization

Reducing latency in high-speed digital systems, particularly in the PHY and its interface with the controller, involves several key techniques. In the PHY, aligning the transmitter clocks accurately is crucial, as misalignment, or skew, can increase the need for buffering, subsequently adding to latency. Therefore, minimizing skew is essential. For the connection between the PHY and the controller, a direct link is ideal, as avoiding extra registers at this interface can significantly decrease latency since each additional register typically increases latency.

On the design front, optimizing the ADC array is particularly important in systems using 64 gigabits with PAM-4 modulation, which increasingly rely on ADCs. Streamlining the data path and fine-tuning the ADC array can effectively reduce the receiver latency. Furthermore, using adaptive Clock Data Recovery (CDR) locking and bandwidth optimization can also help reduce RX latency by ensuring that the system doesn’t waste time on unnecessary adaptations or redundant operations for clock recovery. Implementing these strategies, which focus on both hardware configuration and internal design optimization, is key to achieving lower latency in such systems.

Demands for Massive Amounts of Data are Driving HPC to Increasing Levels of Complexity

The increasing demands for managing massive data are driving HPC towards greater complexity, where efficient power management has become a critical issue, even in data centers. This marks a notable change from the past when data centers, often with ample power supply from the grid, didn't focus much on power consumption. Nowadays, the situation is quite different, with growing concerns about hotspots, environmental initiatives, and the escalating costs of power, making efficient power usage an essential priority.

PCI Express has responded to this challenge by introducing various power states, such as L1 substates, power gating, and the new L0P state in PCI 6, which are vital in managing power, particularly as data transfer rates reach up to 64GT/s. At these high speeds, even a small amount of latency can significantly impact system efficiency and lead to potential stalls. Managing latency is thus becoming more crucial in system design. Additionally, the increasing need for coherency in shared computing tasks is bringing protocols like CXL into the spotlight. CXL, with its low-latency features and coherency, offers a promising solution to balance power, latency, and throughput effectively. The excitement around CXL is growing, especially with the advent of CXL 3.0 and the development of CXL fabrics, marking a significant step forward in the world of high-speed computing and data management.

Summary

Synopsys vast knowledge base and expertise from hundreds of successful PCIe and CXL implementations allows us to boost design’s performance, power optimization and reduce latency from the get-go. We use our experience with a wide array of different customer configurations to do this: from complex controller configurations to a variety of lane combinations comprised of many links. 

Synopsys has a complete PCIe 6.0 and CXL 3.0 silicon-proven PHY, controller, IDE security module, and verification IP. Additionally, we provide guidance in terms of connectivity, simulation bring up, back-end synthesis, advising on the physical placement, clock rebuilding, balancing, routing, timing, and closure guidelines, and the timing-critical path—all of it. We lead the industry in supporting a wide range of features and capabilities to debug firmware and hardware and optimize PPA and latency. 

Synopsys IP Technical Bulletin

In-depth technical articles, white papers, videos, webinars, product announcements and more.

Continue Reading