Cloud native EDA tools & pre-optimized hardware platforms
By Richard Solomon, Technical Marketing Manager, Synopsys
The PCI Express® protocol includes a robust set of Reliability, Availability, Serviceability (RAS) features, but it is up to the SoC designer to ensure this protection is maintained throughout the SoC. For the past few years, most design teams have considered the bus itself to be the primary source of data transmission errors. However, with the advent of teen-nanometer FinFET processes and the migration of enterprise-grade storage to direct-attach PCI Express, more and more attention is turning to on-chip data protection.
PCI Express transmitters add a 32-bit Link CRC (LCRC) to every data packet they send, and the receiver is required to recalculate the received packet’s CRC (Cyclic Redundancy Check) and check it against the incoming LCRC to confirm good receipt (Figure 1). Any bad packets are NAK’d and retransmitted automatically. What happens to the data after receipt, however, is up to the PCI Express controller and SoC designers.
Figure 1: PCIe® LCRC covers packet header and data payload
CRC is an error detection scheme that is well suited to serial data streams, but its overhead has generally left it unused for parallel data – such as the received and de-serialized PCI Express packets in a PCI Express controller.
One fairly simple technique for error detection, which lends itself well to such parallel data, is parity protection (Figure 2). One parity bit is appended to the datapath for each n bits (usually 8) of actual data, such that the total count of ones in the binary field is either even or odd – depending on the variation chosen. For example, in an 8-bit even parity system, the parity bit for the value 255 (1111_1111 in binary) is 0 – as there are 8 ones in the binary representation of 255. In that same system, the parity bit for the value 254 (1111_1110 in binary) would be 1, as there are only 7 ones in the binary representation of 254, so an additional is needed in the parity to make the total count even. (As a historical note, back in parallel bus days, even parity was preferred because the pull-up resistors common to many busses meant an undriven bus would be read as 1111_1111 with parity 1, for 9 ones and thus an even parity error.)
Figure 2: Even parity examples for binary data
Parity, however, is only capable of detecting errors, and it doesn’t always do that very well when multiple bits are in error. In the 255 example, consider if two bits were both in error: both 1111_1111 (the correct value) and 1110_1110 (bits 0 and 4 inverted) have 0 for their parity bit, therefore the 2-bit error case would go undetected (Figure 3).
Figure 3: Two bit errors going undetected
The solution is to use more bits of protection information per chunk of data. Whole volumes can be (and have been) written about the mathematics behind such codes, but with some number of extra bits, it is possible not only to detect that the data has been corrupted but also to identify which specific bit(s) are in error. These techniques are generally referred to as Error Correcting Codes (ECC) because they invert the bad bit(s) to recover good data from bad. Obviously there are tradeoffs in the number of bits used for ECC. For example, the protection code could soon become larger than the data it protects! For this reason, the industry has largely settled on ECC that can detect and correct a single erroneous bit, but only detect when two bits are in error. Some ECC codes in current use require 7 bits to cover 32 bits of data, or 8 bits for 64 bits of data.
Figure 4: Using ECC for multi-bit error detection
So how much protection is enough? As mentioned earlier, the answer to this question depends heavily on the SoC’s application and silicon technology.
For something like a display controller, an error in the datapath may only cause a screen flicker or momentary artifact. For a consumer application, this may be an acceptable outcome and not worth additional design effort and cost. For an aerospace application however, the same may not be true! Likewise, storage controllers place a very high emphasis on data integrity as most end users consider their disk or SSD to be reliable media and have the expectation that their data is written and read correctly. The enterprise space, where the data in question might be bank account balances or financial transaction records, also requires high levels of reliability. In more error-sensitive applications, parity is certainly valuable for detecting an error before it can lead to further corruption, but ECC holds a high appeal for its ability to allow the SoC to continue operating correctly and without loss of data.
Over time, the robustness (real and perceived) of on-chip memories has shifted. In the very early days of multi-micron NMOS processes, on-chip RAM was considered only mostly reliable and therefore almost always protected by parity. As the industry moved to CMOS and sub-micron technologies, RAM reliability improved to the point that many designers neglected any sort of data protection. Now as we find ourselves amidst a shift to FinFET technologies in 16nm, 14nm, and even smaller geometries, questions are again rising about the reliability of on-chip RAMs. With large capacity RAM taking up a good portion of the area of a small die, the likelihood of an external disturbance from a cosmic ray strike or similar becomes even higher. With on-chip CPUs in SoCs now routinely having 64-bit data paths, the data overhead for ECC compared to parity rapidly disappears. Notice that for our example byte-parity (1 parity bit for 8 bits of data), a 64-bit datapath requires 8 protection bits - the same number as can be used by an ECC to provide single-bit correction with double-bit detection.
The on-chip RAM is of greatest concern, so many SoC designers are moving to utilize the simpler option of parity protection on the actual datapaths and the more capable option of ECC on their RAM. The RAM accesses that occur over multiple clock cycles lend well to the correction aspect of ECC – as at high frequencies, ECC logic may need more than one clock cycle to handle a correctable error. Because the errors require more than a single clock cycle for correction, many design teams are finding that ECC is best implemented in the control logic rather than inside the RAM itself. In the case of a PCI Express controller, natural internal pipelining can accommodate the correction phase of ECC without loss of throughput and without forcing complex logic onto the on-chip RAM subsystem trying to meet ever smaller access time requirements.
As designers look to implement PCI Express in today’s design and application environments, they should ensure that their PCI Express controller supports at least byte parity (in their choice of odd or even variations) on its entire datapath, and at least some type of ECC on its RAM. Some designs may even want to consider using ECC on the datapath instead of parity – accepting slightly more complex implementation as a tradeoff for the higher reliability demanded in PCIe storage and other data-critical markets.
For information on how Synopsys PCIe Controller IP can solve your PCIe RAS data protection issues, please go to http://www.synopsys.com/pcie