Go Back

Explore challenges and solutions in AI chip development

Download eBook

Innovate Faster with Synopsys Multi-Die Solution

Accelerating success from early architecture to manufacturing.

Download eBook

Explore Silicon Design, Verification & Manufacturing

Synopsys is a leading provider of electronic design automation solutions and services.

Simpleware Software

Virtual Prototyping

Synopsys Cloud

Unlimited access to EDA software licenses on-demand

Request a Free Trial

Explore Silicon IP

Synopsys is a leading provider of high-quality, silicon-proven semiconductor IP solutions for SoC designs.

Synopsys IP Portfolio

Download Brochure

Synopsys IP Technical Bulletin

Read Latest Issue

Explore Systems Verification and Validation

Synopsys is a leading provider of hardware-assisted verification and virtualization solutions.

System Test Generation

Company Overview

Synopsys and Ansys are Now United

Learn More

Synopsys Blog

Insights that shape the future.

Visit Our Blog

Examining Silent Data Corruption: A Lurking, Persistent Problem in Computing

Jyotika Athavale, Randy Fish

Jul 24, 2024 / 4 min read

Table of Contents

Table of Contents
What is Silent Data Corruption?
A Growing Problem

Many computing errors have been historically blamed on bad code/programming, algorithms and/or users’ errors. And that makes sense, as many performance issues are easily traced to software and it has seemingly been one of the major root causes of many computer errors.

Or has it?

Over the last decade or so, a sleeping giant has been uncovered, lurking in the components that undergird all computing: hardware. More specifically, a hardware problem that’s known as Silent Data Corruption (SDC) is to blame for many performance issues. As computing scales massively at a rapid pace with the demands of AI and machine learning algorithms, the issue of Silent Data Corruption has sharpened and become more intense.

But what is Silent Data Corruption? How do we stop it? And why is it such a pervasive, difficult problem to address?

We sat down with Amr Haggag, head of quality for Silicon Solutions at Arm, Rama Govindaraju, principal engineer at Google, and Robert S. Chappell, partner hardware architecture at Microsoft, to get to the bottom of these questions and more.

What Is Silent Data Corruption?

Silent Data Corruption happens when an impacted device inadvertently causes silent (unnoticed) errors in the data it processes.

For example, an impacted CPU might miscalculate data (such as 1+1=3), and there may be no indication of these errors unless regular scans are conducted, hence the “silent” moniker. In short, these miscalculations are hard to detect and rectify.

If something goes wrong with the software, there are fail-stop mechanisms, user notifications, and various other alerts or indications that something needs to be fixed. With SDC incidents in hardware, there is no notification that something has been miscalculated, leading to corrupted datasets that can go completely undetected.

Due in part to SDC’s stealthy nature, it’s difficult to detect exactly how long it has been a phenomenon in computing. However, it has at the very least been a known problem in the industry over the last seven to eight years or so.

The challenges we currently face to solve this problem are multi-faceted:

Product Lifecycles: Chips and processors have a long production cycle, and while the problem of SDC has been known for several years, it can take a few more years before fixes are reflected in new hardware. This dynamic means we are still fighting this issue from behind.
Cost: Like many issues in computing, cost is a large factor. Whether costs are incurred to change production cycles or it means delaying product releases until they are fully secure, many leaders are wary of taking on a costly change to prevent SDC.
Proving the Problem/Showing ROI: Like cost, it has proven difficult to measure the scale of SDC errors, making it difficult to communicate to decision-makers that this is a huge problem.

Regarding the cost question, we are of the mind that the errors caused by SDC will cost organizations many, many times more to fix than to prevent ahead of time. Trying to debug problems caused by SDC can take many months, which is simply not scalable for most businesses. Benjamin Franklin once said, “An ounce of prevention is worth a pound of cure.” That sentiment is apt here.

Let’s say only one in 1,000 chips are defective. That doesn’t sound like much, and maybe in the past it wouldn’t have caused too many issues. But in today’s world, machine learning algorithms are running on tens of thousands of chips. This means that over a long run time, these corruptions can derail entire datasets and demand massive expenditure to fix. As those working in the AI field know all too well, workloads are increasingly occupying a huge footprint and disruptions are increasing by the day.

We need to ask ourselves how we can become better at screening. We need to research up and down the entire stack, from hardware to software and everything in between. More critically, we need holistic solutions starting in design or even process technology.

hardware defect screening silent data corruption

Rate of defect screening with DCDIAG test on third-generation Intel Xeon Scalable SoCs (Source: Intel)

A Growing Problem

Unfortunately, SDC is a problem that is getting worse as time goes on. The scale of computing needs is not slowing or even leveling out; rather, it’s accelerating at an unprecedented rate. Integration is increasing and more resources are packed into single parts, leading to more complex processes, both in the production cycle and implementation. Systems are stressed more than ever, and there’s little relief in sight.

Our challenge now is explaining the scope of the problem as we understand it presently. We are still trying to get our arms around just how widespread SDC is. It’s difficult to attack a problem that is not fully illuminated, much less explain to others why said problem should be taken seriously.

Industry leaders may be justified in not expending resources for a problem they don’t fully understand. But errors at the scale SDC can insert can be much more costly than preventing them. We need academics and researchers, business leaders, engineers, and everyone in the production and business operations ecosystems to come together and take this problem seriously before it gets out of control.

To learn more, check out Speaking Up About Silent Data Corruption and the second part of this blog series that highlights solutions for addressing SDC. You can also watch the full-length video of our panel session with Arm, Google, and Microsoft below.

Speaking Up About Silent Data Corruption - Part I

These video panels will explore the challenges posed by Silent Data Corruption (SDC) and the strategic interventions within the realm of Reliability, Availability, and Serviceability (RAS) for contemporary systems.

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00