Cloud native EDA tools & pre-optimized hardware platforms
Venugopal Santhanam, Staff Engineer, Synopsys & Malte Doerper, Product Marketing Manager, Synopsys
The demand for high definition, rich visual content in a wide range of mobile devices is driving the evolution of next-generation displays. Displays need to deliver higher resolutions, faster refresh rates, higher fidelity with more colors, better brightness, and contrast. Such requirements have created bandwidth demands that is growing faster than the bandwidth of current interface standards such as MIPI DSI, JEDEC DisplayPort, and HDMI. To meet these new demands while allowing interoperability, all the key display standard bodies like MIPI Alliance, HDMI Forum, and VESA, have collaborated to develop the visually lossless Display Stream Compression (DSC) standard. VESA DSC 1.2a and 1.1 algorithms perform video compression and decompression in real time for streaming video data. The DSC v1.2a standard targets cabled connections and allows scaling beyond the uncompressed limits of 48 Gbps for HDMI 2.1 and 32.4 Gbps for VESA DisplayPort 1.4. The DSC v1.1 standard targets mobile devices and allows for the reduced link speed to decrease system power and extend battery life in MIPI DSI and VESA embedded DisplayPort.
The industry is expected to widely adopt the VESA DSC standard because today’s 8k60 resolutions and 4:2:2 color range demands the uncompressed limit of 48 Gbps for HDMI 2.1. Furthermore, increases in refresh frequencies are in consideration [1]. As per the January 4, 3018 IHS Markit report by Norman Akhtar and Dinesh Kithany [2], the display device market is expected to grow significantly, “registering a five-year CAGR of 131 percent” in the next 5 years. Similarly, as per the Persistence Market Research article published in March 2018 [10], the AR/VR industry, also in need of high refresh frequencies and resolutions in small form factors, has “a CAGR value estimated through the forecast period 2018-2026 of more than 49 percent.”
Figure 1: The billion units of global device shipments by wired video interface technology[2]
The growth shown in Figure 1 is proof that the VESA DSC standard is needed across the different display interface standards and justifies the need for a VESA DSC compression and decompression implementation.
This article will explain the implementation challenges of VESA DSC and how designers can overcome such challenges with a complete and compliant VESA DSC Encoder and Decoder IP solution.
Multiple Slices Per Line
The DSC algorithm is highly scalable in terms of paralyzing the codec operations. This capability is important as the datapath is very compute intensive. To meet the pixel-clock timing constrains, the frame is split into equal size slices along the vertical and horizontal directions and then processed in parallel. The amount of parallelization needed depends on the pixel rate. Table 1 notes the HDMI 2.1 protocol resolutions and the respective number of slices needed to keep the datapath clock at around 300 Mhz.
Table 1: VESA DSC’s HDMI 2.1 protocol resolutions and the respective number of slices needed to keep the datapath clock at around 300 Mhz
Table 1 shows the required number of slices per line increases from four to eight when the video format changes from 4K60 to 8K60. Therefore, the DSC implementation needs a highly programmable control unit that can route the data to the required number of datapath slices.
Slicing the frame allows for reduction of the codec operation frequency. However, all slices need be of the same width and height. The width per slice is calculated by dividing the frame width by the total number of slices and rounding up. For example, if the picture width is 5120 pixels and the number of required slices per line is 12, then, ceiling(5120/12) = ceiling(426.66) = 427 is used as the slice width. The ceiling() function rounds to the smallest integer greater than or equal to the fraction.
This leaves the right most slice with 5120-427*11 = 423 pixels. Since the width of all slices need to be equal, the right most slice needs padding of four pixels. The contents of the padded pixels are chosen as per the VESA DSC 1.2a algorithm, which changes based on the mode of operation.
As the frame’s width and number of active slices change based on the video mode, the number of padded pixels that the encoder inserts the changes accordingly. Table 2 summarizes the required padding for number of slices per line.
Table 2: Padding calculations for number of slices per line
On the DSC decoder side, the removal of the padded pixels are in sync with the padding process on the encoder side. The challenge for the decoder implementation is that there is no sideband communication to convey the padding, the decoder must determine the amount of padding and the notion of the right most slice to drop the padded pixels accordingly based on the video mode.
For example, a configuration with maximum of 8 slices per line could either have 1, 2, 4, or 8 active slices. Slices are sequentially numbered from slice-0, slice-1 to slice-7. Operating with 2 active slices (meaning only slice-0 and slice-1 are active), then slice-1 becomes the right most slice, hence padding, if necessary, must be done in this slice alone at the end of every line.
Table 3 lists the right most active slices based on the number of slices enabled in the configuration and the number of active slices chosen under the video mode.
Table 3: Maximum number of slices and active slices in video mode
Pixel lines must be padded also when the slice height is not a multiple of picture height. For example, if the picture height is 960 lines and slice height is 500 lines, then the bottom most slice(s) padding is necessary (Figure 2).
Figure 2: Padding scenario when the slice height is not a multiple of picture height
The HDMI, DisplayPort, and MIPI DSI protocols use the DSC algorithm in Constant Bit Rate (CBR) mode, where the algorithm generates the compressed video stream data averaging around a constant value (equal to bits per pixel field of Picture Parameter Set (PPS)). However, towards the end of the slice, due to algorithm behavior, there might be a need to generate pad data (consisting of streams of 0s) to keep the bit rate constant.
The amount of zero padding is calculated by:
Number of zero pad bits = (slice height x chunk size x 8) – total number of compressed bits generated for the slice
On the encoder side the implementation challenge is that the exact number of zero padded bits can be determined only after the entire slice is encoded. The amount of zero padding typically varies and is a function of several parameters such as slice width, target bits per pixel (compression ratio), chunk size, etc. Also, the zero padding must be completed before the slice operation for the next slice is initiated.
Similarly, on the decoder side, after slice processing is completed (exact number of compressed bits are removed during decompression), the zero padded bits must be removed.
To illustrate zero padding, the following example configurations are chosen:
Picture width = 3840 pixels
Picture height = 3909 lines
Slice width = 1920 pixels
Slice height = 489 lines
Bits per component = 10
Bits per pixel = 15 (Hence compression ratio of 2.0)
Chunk size bytes = 3600
Total bytes generated by DSC algorithm = 1,759,794
Total bytes to be generated as per slice budget = slice Height x chunk size
= 489 x 3600 bytes
= 1,760,400 bytes
Total deficit bytes for which zero padding is done = 1,760,400- 1,759,794
= 606 bytes
In this example, the DSC encoder starts sending 606 bytes of zero padded data after the entire slice is encoded. On the DSC encoder side, a datapath slice must ensure that the zero padded data is sent out before data processing of the next slice starts, otherwise, it might block the processing.
On the decoder side, the above example shows the zero padded bytes need to be flushed out prior to decoding the next slice.
Synopsys’ DesignWare VESA DSC IP automatically strips the zero padded bits before sending the decompressed stream to the controller.
To achieve the targeted pixel rate, the number of parallel datapaths should be increased to 2, 4, 8, 12, or 16, which allows to scale-down the operation frequency. Further, a careful examination of the algorithm reveals that there are certain sections in the algorithm that could be implemented as parallel or semi-parallel structures based on the area and timing requirement tradeoffs. For example, some parts of the algorithm (example the rate control algorithm for Quantization Point (QP) updates) can be spread across 3 consecutive clock cycles or equivalently one group time.
Synopsys’ DesignWare VESA DSC IP provides the required flexibility needed for area and timing (where parallel, semi-parallel datapath structures are implemented) tradeoff based on the synthesis target node. On nodes where meeting the target timing is challenging, the parallelism can be increased with marginal increase in gate count to achieve better timing.
On the sink side, after codec operation is complete, it is necessary to re-generate the required video sync signals as per the target standard (HDMI, MIPI or DisplayPort). This poses an integration challenge as the VESA DSC standard does not describe any details for video sync signal regeneration. To solve this challenge, it may be necessary to have a buffer between the DSC and controller. DSC writes into this buffer. The controller could read this buffer after certain fill-level is reached so that the entire line of compressed data is read out in a single burst.
One of the major hurdles in wearable AR/VR devices is power consumption. Memory operation has a significant impact on power consumption. As the DSC algorithm requires buffering at different stages of computation (in pixel buffers, line buffers, rate buffers, balance FIFO, and syntax element FIFOs) an optimal memory architecture can relevantly reduce power. As the DSC algorithm’s memory is accessed simultaneously for both write and read operations, a two-port RAM-based memory interface becomes the ideal choice.
But single-port RAM memories of considerable sizes typically occupy a smaller area and consume less power compared to a two-port or dual-port RAM with the same capacity. Scheduling the memory write and read requests requires some intelligence on the DSC architecture side to be able to generate data, which is compatible for interfacing with single-port RAMs.
Synopsys’ DesignWare VESA DSC IP has the necessary memory interface and memory scheduling logic to interface with single port memories and addresses the need for reduced memory area and power.
A wrong DSC configuration can result in a system that is unresponsive, thus error reporting features are essential for quick error detection and resolution. The DSC blocks are configured through the Picture Parameter Set (PPS). The encoder and decoder need to have the same PPS settings for the system to work. The decoder indicates its capability through the Capability Parameter Set (CPS), requiring the PPS to be set accordingly.
For the datapath operation, a self-recovery mechanism is essential. The DSC algorithm is structured where each slice operation is self-governed. This implies that an error in one slice does not impact the operation of the next slice.
Additionally, the decoder needs to check for the following error that may occur due to data corruption in the DSC stream:
The DSC standard aggregates multiple video streams and transfers them over a single link. This adds additional implementation complexity. Figure 3 shows a DSC host and device controller with four slices per line where all four slices are serviced by a single link.
Figure 3: System with DSC Codec with four slices per line for a link
Figure 4 shows a DSC mode where two video streams travel over one single link. Hence with the same DSC configuration, two streams can be served to carry different video content.
Figure 4: System with DSC showing two slices per line serving two separate video streams
As demand for high definition, rich visual content in a wide range of devices increase so does the need for higher bandwidths. Display standards organizations such as MIPI DSI, JEDEC DisplayPort have collaborated to develop a visually lossless Display Stream Compression (DSC) standard to allow scaling beyond the uncompressed link limits of 48 Gbps for HDMI 2.1 or 32.4 Gbps for VESA DisplayPort 1.4. Industry reports predict that mobile devices such as smartphones, AR/VR, and automobiles are in growing need of higher display bandwidth, hence the need for a compression and decompression solution such as the VESA DSC standard. Implementing the algorithm may be challenging. Some of the implementation challenges include multiple slices per line support, determining the required number of pixel padding, reduced memory footprint and power, error detection, recovery, and aggregation.
To minimize integration risk and accelerate time-to-market, Synopsys offers designers a complete and compliant VESA DSC IP solution that interoperates with DesignWare HDMI 2.1, DisplayPort, and MIPI DSI IP. It scales up to 16 slices to provide high-performance data transmission. The VESA DSC encoder and decoder is compliant with the DSC 1.1 and 1.2a standards, supporting the required 120Hz refresh rate for up to 10K resolutions.
References: