Cloud native EDA tools & pre-optimized hardware platforms
Derya Eker, ARC Processors Engineering Manager, Synopsys
Diego Gonzalez Montes, ARC Processors R&D Engineer, Synopsys
Today’s high-end systems-on-chips (SoCs) need to handle increasingly compute-intensive workloads but must carefully balance power-to-performance tradeoffs. The demand for wide deployment of artificial intelligence (AI) and deep learning is surging. Face recognition is paramount in mobile phones and extending to smart wearables. Identifying objects and surroundings in augmented- and virtual-reality headsets further push the envelope. Self-driving cars apply deep learning to interpret, predict and respond to data coming from surroundings for safer, smarter autonomous driving.
To optimize for both power and performance, hardware becomes more tightly intertwined with software. Designers must make key architectural choices such as hardware/software workload partitioning and IP vendor selection in the early phases of product development. Today’s SoCs represent a multi-million dollar investment, so the accurate estimation of SoC power is critical to whether your chip is a success or a failure.
By the nature of deep learning applications, most of the processing elements in a chip are busy for long periods to sustain compute power. The power dissipation must fit in the power budget of the target device, whether a smart phone or autonomous car. In addition, battery lifetime, thermal issues affecting reliability, packaging, and cooling all add additional constraints. Designing to hit your power budget is critical, so accurate and predictable power estimation early on is increasingly important.
Before discussing the challenges of estimating power both accurately and as early as possible, let’s briefly look at how designers calculate power. Power dissipation (Ptotal) of a device can be split into two types: dynamic power consumption (Pdynamic) caused by the switching activity, and static power consumption (Pstatic). The main parameters that impact these two components are summarized in Figure 1.
Figure 1: Variables in power consumption to consider in vision SoC designs
Figure 2 shows the impact of process technology scaling on both dynamic power and static power. As we move to the smaller technology nodes, dynamic power goes down. However, static power starts becoming more dominant due to increased leakage current (left). In addition to this, threshold voltage of the cells in the same technology node (right) can affect frequency/delay and leakage power.
To achieve higher frequency and performance, designers might want to use low Vth cells. This increases leakage current, which is part of static power. There is a constant need during design process to make trade-offs to balance between power and performance.
Figure 2: Impact of process technology scaling on dynamic and static power
Designers can apply wide range of power reduction techniques to reduce power.
As performance and power are closely interlinked in the AI and vision domain, individual metrics on performance and power do not give the full picture. There are many factors affecting accuracy and correctness of power estimation. Therefore, the conditions under which power is estimated must be explicitly clarified. Let’s start with the commonly used metrics:
Energy in terms of Joules per frame for representative graphs is the most accurate metric to evaluate CNN applications’ power consumption. However, computing the average power per frame is challenging. In many cases, to maximize throughput, systems have multiple images being processed simultaneously, either in batch or pipeline mode. Since power and performance are closely related, power measurements should be done with the right batch size and/or the when the pipeline is in a steady state. Processing only a single frame can take hundreds of millions of cycles. Reaching the correct steady-state for measurement will require many more cycles, sometimes in the order of billions. State-of-the-art simulation tools cannot handle this kind of workload in a reasonable time, not even for the smallest graphs.
Instead, designers often measure the energy efficiency of a single convolution layer of a graph such as the multi-layered SegNet graph (Figure 3). However, the common pitfall is to extrapolate the result to a full graph. Taking such shortcuts can be misleading for several reasons:
Hence, depending on position or graph architecture, the same layer may require a different amount of energy. In addition, other layers, such as activation functions, element-wise operations, and deconvolution also need to be accounted for.
Figure 3: SegNet architecture implements multiple layers. Depending on position or graph architecture, the same layer may require a different amount of energy, so no single layer can be extrapolated to represent the entire graph
Orthogonal to the stimuli used, power estimation accuracy is greatly affected by the applied power estimation methodology combined with the abstraction level of the design that is measured:
The more details captured in the actual implementation, the more accurate power estimate becomes. RTL simulation of a small synthetic benchmark may complete in minutes, but for a netlist it can take hours or days. Simulation of a very deep CNN graph with all implementation details included may require weeks. This simulation time challenge increases the risk that IP vendors may skip such detailed power analysis and accurate power estimation. The result is that an actual power consumption may exceed the power budget; a clear product risk manifesting later during SoC power sign-off phase.
To be able to execute billions of cycles of a CNN graph on a full-layout netlist to achieve maximum accuracy in the power measurements, simulation tools are simply not enough. Synopsys’ ZeBu® Empower provides a solution that can help both IP developers and SoC designers to compute power accurately for hundreds of millions of processed cycles in a matter of minutes or hours instead of weeks or months. ZeBu Empower also supports advanced use modes, including power management verification, comprehensive debug and integration with Synopsys’ verification ecosystem, hybrid emulation with virtual prototypes and architectural exploration and optimization. Therefore, access to ZeBu Empower enables both easy exploration of power/performance tradeoffs with application software on various candidate hardware architectures, and efficiently achieving sign-off quality power estimates, helping to tune power consumption of all elements in a system during the different stages of the design cycle. Designers using Synopsys’ DesignWare®ARC® EV7x Vision Processors are adopting the Zebu Empower software-based power estimation and sign-off flow to get the most accurate and realistic power estimates when using the EV7x processor to handle high-performance deep learning applications.
Estimating the power consumption of IP blocks for AI applications in an SoC can be a challenge. Designers need to carefully consider all aspects of the power estimation process to ensure that the decisions they make early in the process allow them to stay within their power budgets when silicon gets back. Verifying a design on ZeBu Empower is a more accurate means of estimating and tuning power consumption than deriving estimates from a single convolutional layer.