Snippet Scanning, Explained
When developers build applications, they often use entire open source files. But they may sometimes also pull in parts of files. These smaller blocks or lines of code are known as snippets.
Historically, programmers have included snippets from sites like Stack Overflow. AI coding tools like GitHub Copilot are also potential snippet sources; the AI tool may output a snippet of the open source library upon which it’s trained.
Although a snippet is only part of a larger file, the open source license that governs the full file still applies to the snippet. If, say, a file is licensed under the GPL v3 license, a snippet from that file would be as well. Similarly, if a file is impacted by security vulnerabilities, using a snippet from that file may pose security risks as well.
Snippet scanning tools work by analyzing a snippet and matching it against a database of known open source components. The goal is to identify the full file (and accompanying license) where the snippet originated, along with potential vulnerabilities that impact it. This way, organizations can make sure they’re complying with open source licensing requirements, managing vulnerabilities, and keeping their software bill of materials accurate and up to date; The commonly used SPDX and CycloneDX SBOM formats both support snippets.
How FOSSA Snippet Scanning Works
The growing popularity of AI coding tools — and the concerns we’ve heard from customers about the potential snippet-related risks — prompted FOSSA to add snippet scanning functionality to our software composition analysis solution.
As with other snippet scanning tools, FOSSA detects snippets, matches them to full files, and surfaces important dependency metadata (such as the file name and open source license).
However, our specific approach to snippet scanning is different from some vendors. Snippet scanning tools generally use one of two methodologies to match snippets to their parent files:
- Function-level matching
- Expression-level matching
These are significantly different snippet-scanning implementations. Function-level matching detects functions in a file, whereas expression-level matches on expressions, of which there are commonly multiple per line. FOSSA’s snippet scanning features use function-level matching since we believe it produces significantly less noise than expression-level matching while still focusing on what users actually care about.
Snippet Scanning Example
Function-level matching would, for example, index the function “levenshtein” in the code block below, and then check if this function exists in the users’ codebase.
Expression-level matching would treat many permutations of the expressions inside this function as separate snippets. For example, depending on how aggressively the expression extraction is tuned, each line in this example function could consist of multiple snippets.
Function-level matching reduces noise due to the reduction in permutations that fit our match criteria. We believe this will meet our user’s needs because the snippets we detect will be less noisy than expression-level snippets.
Additionally, our view is that expressions are an unlikely source of meaningful risk. Developers don’t copy individual expressions without modification due to compiler errors; if they somehow can do so, the expression is so vague as to be purely noise.
Learn More About FOSSA Snippet Scanning
You can get a demo of our snippet scanning feature — along with guidance on managing risks that can come from AI coding tool output — in our on-demand webinar: “Managing GitHub Copilot Security and Legal Risks with FOSSA.”