Once and future code snippets: How AI reignites risk

Jul 24, 2024 / 3 min read

Table of Contents

Managing open source risk with SCA
Addressing Generative AI risks in open source

Code snippets copied from copyleft-licensed open source projects represented the biggest risk in software 15 years ago. The Heartbleed vulnerability, discovered in April 2014, brought to the fore concerns about the security of open source components, and license risk took a bit of a back seat. But the problem never went away. Now, the advent of Generative AI as a tool for writing software is shining a new light on the issue.

Code snippets are a legal concern only because of the way some open source licenses are written. One of the most common open source licenses is GPL 2.0. The obligations of this license apply to “work containing the Program or a portion of it.” “Program” refers to a GPL-licensed component, and a snippet is a portion of a component, theoretically of any size. Copyleft licenses are sometimes referred to as “viral” because just a little germ of code can “infect” the application in which it’s used with the obligations of the license. And compliance can be especially difficult for commercial applications.

This particular license is not the only one of concern, however. Code copied from Stack Overflow, for example, is available under the Creative Commons Attribution ShareAlike license and also raises flags.

Managing open source risk with SCA

Tracking open source usage to manage such risks has always been fundamentally challenging. Any developer with a browser has access to literally millions of open source components, and the ability to download and paste them into their own code. Companies can and should have policies and processes to guide developers in their open source use, as well as programs to educate them about the tools that can detect and identify the open source in a codebase—including snippets, which can be particularly tricky to detect if you’re not using the right tool.

The tools that dissect code are known as software composition analysis (SCA), and there are a number on the market. Most address security and assume disciplined use of package managers, which allow developers to pull complete open source components (or libraries) into their code. Essentially, a developer specifies “go get X” in a build file, and the package manager gets X. SCA tools interpret that instruction and conclude that component X is in the code.

That simple approach works reasonably well for security vulnerabilities because most vulnerabilities are part of the overall function of the component and not likely to manifest in a 100-line snippet of code. But in order to gain full visibility into licensing issues, you need to detect snippets. The other limitation of this approach is that it requires that open source is only incorporated via package managers. In reality though, open source ends up in software via multiple paths. So this is a good 80% solution for modern application development, but gaining a comprehensive picture requires additional techniques.

Identifying snippets requires sophisticated algorithms and a comprehensive knowledgebase of the millions of open source components in order to efficiently see if any parts of a codebase match open source code. A tool needs to have been architected specifically to include that capability (to augment other techniques like the package manager approach), and there are few on the market that meet this requirement.

Addressing Generative AI risks in open source

n late 2022, a wave of Generative AI tools that could write software code caught the world’s attention. These tools will undoubtedly transform the way software is developed in the future. However, these tools were trained on open source, and it didn’t take long for court cases to be brought alleging that they were cutting and pasting verbatim snippets and not identifying them or complying with their license requirements. So even companies with processes to control their developers’ open source use now have a new source of legal risk to contend with. There are also concerns about the security of AI-generated code.

So how does a company protect itself? Development organizations must be mindful of how they use Generative AI tools in software development, and many require tools like Black Duck^® SCA, which can detect snippets of open source. And in an M&A transaction, acquirers need to understand that a target company’s developers might have (even under the management radar) used Generative AI tools. Questions about Generative AI tool use should be part of due diligence, and audits down to the code snippet level should be the norm.

- This blog post was reviewed by Mike McGuire.