Cloud native EDA tools & pre-optimized hardware platforms
Ron Lowman, Product Marketing Manager, Synopsys
Over the past decade, designers have developed silicon technologies that run advanced deep learning mathematics fast enough to explore and implement artificial intelligence (AI) applications such as object identification, voice and facial recognition, and more. Machine vision applications, which are now often more accurate than a human, are one of the key functions driving new system-on-chip (SoC) investments to satisfy the development of AI for everyday applications. Using convolutional neural networks (CNNs) and other deep learning algorithms in vision applications have made such an impact that AI capabilities within SoCs are becoming pervasive. It was summarized effectively by Semico’s 2018 AI Report “...some level of AI function in literally every type of silicon is strong and gaining momentum.”
In addition to vision, deep learning is used to solve complex problems such as 5G implementation for cellular infrastructure and simplifying 5G operational tasks through the capability to configure, optimize and repair itself, known as Self Organizing Networks (SON). 5G networks will add new layers of complexity, including beam forming, additional spectrums in the mmWave, carrier aggregation, and higher bandwidths, all of which will require machine learning algorithms to optimize and handle the data appropriately in a distributed system.
Both industry giants and hundreds of startups are focused on driving AI capabilities into scores of new SoCs and chipsets in industries across the spectrum, from cloud server farms to home assistants in every kitchen. SoC designers are using more examples from biology than just the neural networks they aim to replicate – they are embracing the concept of using both the fundamental building blocks of a device – the nature, or DNA – and the nurturing of an AI design – the environment in terms of design tools, services, and expertise – to outperform their competition and consistently improve their products.
Adding AI capabilities into SoCs has highlighted weaknesses with today’s SoC architectures for AI. Vision, voice recognition, and other deep learning/machine learning algorithms are resource-starved when implemented on SoCs built for non-AI applications. Selecting and integrating IP clearly determines the baseline effectiveness of an AI SoC, which makes up the “DNA,” or nature, of the AI SoC. (See: The DNA of an Artificial Intelligence SoC). For example, introducing custom processors, or arrays of processors, can accelerate the massive matrix multiplications needed in AI applications.
However, the element of nurturing the design affects how the pieces function together in hardware or how IP can be optimized for more effective and optimized AI SoCs. The design process to optimize, test, and benchmark the performance of the SoC requires tools, services and/or expertise to optimize the AI system. Nurturing the design during the design process with customizations and optimizations can ultimately determine the SoC’s success in the market.
Using tools, services, and expertise to enhance power consumption, performance, and costs is becoming more important as AI SoCs continue to increase in complexity. Designers need a wide array of nurturing methods to accelerate their design process and silicon success.
Relying on the traditional design processes will not result in the high-performance, market-leading AI solutions that every company aims for. Designers must consider a wide array of semiconductor solutions. A Semico 2018 market report states that “Architectures for both training and inference are continually being refined to arrive at the optimum configuration to deliver the right level of performance.”
Datacenter architectures include GPUs, FPGAs, ASICs, CPUs, accelerators, and High-Performance Computing (HPC) solutions, while the mobile market is a potpourri of heterogeneous on-chip processing solutions such as ISPs, DSPs, multi-core application processors, audio and sensor processing subsystems. These heterogeneous solutions are leveraged effectively with proprietary SDKs to accommodate AI and deep learning capabilities. In addition, the automotive market sees large variations based on expected autonomous capabilities. For instance, bandwidths and compute capabilities of Level 5 autonomous SoCs support far more performance than Level 2+ autonomous SoCs, as can be expected.
The three consistent challenges within these AI designs includes:
One of the biggest hurdles for machine learning algorithms is that the memory access and processing capabilities of traditional SoC architectures is not as efficient as needed. For example, popular von Neumann architectures have been criticized as not being effective enough for AI, resulting in a race to build a better machine (i.e., SoC system design).
Those who are fortunate enough to design second and third generation AI targeted SoCs have added more efficient AI hardware accelerators and/or have chosen to add capabilities to existing ISPs and DSPs to accommodate neural network challenges.
However, simply adding an efficient matrix multiplication accelerator or high bandwidth memory interface is proving to be helpful but insufficient to be a market leader in AI, reinforcing the concept to do specific optimizations during system design specifically for AI.
Machine learning and deep learning apply to a wide variety of applications, so designers vastly vary in how they define the objective of the specific hardware implementation. In addition, the advancements of the math for machine learning is changing rapidly, making architectural flexibility a strong requirement. In the case of vertically integrated companies, they may be able to narrow the scope of their design to a specific purpose, increasing optimizations, but also accommodate flexibility to match additional, evolving algorithms.
Finally, benchmarking across AI algorithms and chips is still in its infancy as discussed by The Linley Microprocessor Report’s “AI Benchmarks Remain Immature”:
“Several popular benchmark programs evaluate CPU and graphics performance, but even as AI workloads have become more common, comparing AI performance remains a challenge. Many chip vendors quote only peak execution rate in floating-point operations per second or, for integer-only designs, operations per second. But like CPUs, deep-learning accelerators (DLAs) often operate well below their peak theoretical performance owing to bottlenecks in the software, memory, or some other part of the design. Everyone agrees performance should be measured when running real applications, but they disagree on what applications and how to run them.” (January 2019)
Interesting new benchmarks are beginning to address specific markets. As an example, MLPerf is currently tackling the effectiveness of training AI SoCs and has plans to expand. While this is a great start to addressing the challenges of benchmarking, training AI SoCs is a small subset of the many different markets, algorithms, frameworks, and compression techniques that impact a system’s results.
Another organization, AI-Benchmark, is focused on benchmarking the AI capabilities in mobile phones. Mobile phones use a handful of chipsets, some with early generation versions that do not include any AI acceleration other than their traditional processors, but instead implemented AI-specific software development kits (SDKs). These benchmarks show that leveraging existing, non-AI-optimized processing solutions does not provide the throughput required.
The processor or array of processors selected typically have maximum ratings on operations per second or a specific top frequency for a specific process technology. The processor performance is also dictated by the capability of each instruction. On the other hand, interface IP (PCIe®, MIPI, DDR) and foundation IP (Logic Libraries, Memory Compilers), have maximum theoretical memory bandwidth and data throughput levels that are, in the case of interface IP, often defined by standards organizations.
However, the true performance of a system is not the sum of these parts; it lies in the ability to properly connect processors, memory interfaces and data pipes together. The total system performance is a result of the capabilities of each of the integrated pieces and how to optimize these.
While designers have made rapid advancements to the processors, the SDKs, the math and other contributing design aspects of an AI SoC, these changes have made the ability to make an apples-to-apples comparison difficult.
Compression will be a critical component for edge AI, such as in a camera doing real time facial recognition, a car doing autonomous navigation, or a digital video performing super image resolution. The market appears to be just scratching the surface of using compression. Understanding the type of algorithms, and the level of accuracy a certain level of compression enables, is difficult and can also require iterations of trial and error.
While the role of hardware/software co-design has been discussed for many years AI SoCs are likely to magnify the importance in actually implementation. The concept of co-designing an AI chip isn’t limited to hardware and software. Memories and processors will also need to be co-designed specifically for AI.
For instance, co-design is evident in how many Google TPUs exist per Intel Xeon Host Processor in a system, and that is outlined in the configurations and software program manuals of their single board computers.
Using different AI frameworks for the same AI algorithm is another example of where co-design can increase efficiencies. The outputs of each framework can require different memory capacities. Understanding the memory capacities prior to hardware design enables designers to optimize for power, area, and performance.
Co-designing for AI with respect to memory and processing is imperative. For instance, deep learning algorithms require storage of weights, activations, and other components. Interestingly, the activations for deep learning algorithms can be recalculated each time to reduce memory storage. Even though additional processing resources or additional time spent processing must be accounted for, the benefit in memory savings and power consumption can outweigh the penalties. On a similar note, in-memory-compute technologies may play a future role for AI SoCs.
These co-design examples are driven by new investments in AI and this trend will continue to require new and additional expertise.
The tradeoffs that co-designing a system architecture require can be optimized by AI implementation experts. Experts not only have prior knowledge of how things worked in a prior design, but they also have a very good understanding of the proper tools and services that better enable a design’s success. During the AI design process, designers are adopting simulators, prototyping, and architectural exploration to be able to quickly implement the best design practices.
For example, let’s take a chipset that must perform a very difficult task in a very limited power budget. The bandwidth of pipes within the SoC must be large enough to move data from processor to memory or to other system components without utilizing significant resources. The smaller the pipes, the more processors and memory can be added. The larger the pipes, the less processing and memory is available, which directly impacts the AI performance. These tradeoffs can be modelled in simulators, prototyping environments, and architectural exploration tools giving the AI design critical market advantages.
The development process for SoCs continues to change but inherently includes standard stages such as system specifications and architectural design; logic, and functional circuit design; physical design, verification and analysis; fabrication, packaging and testing; and post silicon validation. New AI capabilities can add complexities at each stage. The integrated IP clearly dictates some theoretical maximum capabilities, but how the design is nurtured enables the implementation to creep closer to those theoretical maximums.
Because traditional architectures have been found to be inefficient for AI SoCs, system specifications are requiring more and more architectural exploration to optimize designs. Architectural services are more important since the traditional architectures have been deemed less efficient.
Moreover, AI SoCs are being modified from generation to generation, leveraging experienced design teams for optimization and customization. Deep learning algorithms include many stored weights and ideally are stored in on-chip SRAM to save power and processing effort, and customizing to optimize SRAM compilers for power and density are a clear trend.
AI offers new challenges with respect to security breaches. The data for AI tends to be private, the algorithms developed are very expensive, and the expense of corrupting just a single bit can be catastrophic with respect to the accuracy of the end results. Implementing a full Root of Trust subsystem or secure enclave can be valuable, but may require additional consulting to ensure specific breaches are protected based on the defined threat profiles developed early in the SoC process.
Machine learning math can require scalar, vector and massive matrix multiplication, and specialized processors which can be designed to optimize specific algorithms. Custom processors are one of the most popular IP developments for new AI SoC solutions. Tools to design custom processors are becoming inherently valuable to both ensure optimizations at the gate level are leveraged and reused and to keep up with the ecosystem that is required to support custom processors. For instance, RISC-V has gained popularity; however, it only defines an instruction set that many times needs additional special instructions to handle machine learning, along with the necessary compilers and specific design instantiations for optimization. The costs of design, support, and software implementations must be planned and supported long term by internal design teams. Having tools and support to manage this can bring great benefits for successful implementations.
Developing an AI SoC requires some of the most innovative IP on the market. Examples include the quick adoption of new technologies such as HBM2e, PCIe5, CCIX, and the latest in MIPI. To nurture the design implementation of these standard technologies, designers require advanced simulation and prototyping solutions with support for early software development and performance validation. These tools are being adopted much more regularly for AI, again due to the immaturity and complexity of the designs.
A pre-built AI SoC verification environment can only be leveraged by those who have experience with AI SoC development. Therefore, design services and companies designing the second and subsequent generation chipsets have inherent advantages over first comers in time-to-market. Designers can rely on services as an effective way to leverage AI SoC expertise for faster time-to-market, freeing up internal design teams to focus on differentiating features of the design.
Hardening services for interface IP, which is another optimization tool, enables lower power and lower area implementations. Hardened IP makes room on the SoC for valuable on-chip SRAM and processor components needed for better AI performance.
Finally, benchmarking different AI graphs easily and quickly comes with expertise and established tool chains. Hand writing these graphs for benchmarking activities could be an arduous task but a necessary one to understand if the SoC design can provide the needed value. Relying on processors that have tools to benchmark these graphs effectively and quickly expedites the system design, ensuring it meets the requirements.
AI SoCs are using some of the most advanced FinFET process nodes to increase performance, cut power, and increase on-chip memory and compute capabilities. But, from a testability standpoint, the latest process nodes increase the number of test modes, and increase potential for soft defects. Test integration, repair and diagnostic capabilities help designers overcome the testability hurdle. Tools such as Synopsys’ DesignWare STAR Memory System and STAR Hierarchical System are effective in solving AI test needs.
New technologies such as HBM2, and future HBM2e require special packaging expertise and capabilities, leading to the need for special bump planning and other packaging expertise related to the development of AI SoCs.
As AI capabilities enter new markets, the IP selected for integration is providing the critical components of an AI SoC. But beyond the IP, designers are finding a clear advantage in leveraging AI expertise, services, and tools to ensure the design is delivered on time, with a high level of quality and value to the end customer for new and innovative applications.