Benchmarking SAST
Choosing the Right Static Analyzer
Static Application Security Testing (SAST) is a powerful approach for catching vulnerabilities early in the development lifecycle. By scanning source code for security flaws without executing it, a SAST tool can potentially prevent bugs from ever reaching production. But not all SAST tools are equal, and without proper benchmarks to compare them, teams can’t confidently choose the most effective solution.
Why is Benchmarking SAST Tools Needed
It is tempting to pick a scanner based on vendor claims or anecdotal experience, but without objective benchmarks, organizations risk choosing a tool that leaves critical vulnerabilities undetected or overwhelms developers with false alarms. A lack of robust benchmarking leads to ambiguity, as teams cannot be confident about a tool’s real accuracy or coverage. In fact, not all companies In fact, many companies do not have the time or resources to run thorough evaluations, so they resort to cursory tests (like scanning a couple of random open-source projects) to gauge SAST tools. This ad-hoc approach is unreliable and can misrepresent a tool’s capabilities.
When done well, benchmarking delivers:
Objective scoring: Tool-agnostic metrics (precision, recall, F1) that can re-run across versions.
Signal-to-Noise Clarity: With standardized tests, teams can quantify how well a tool finds real vulnerabilities versus how much noise it produces.
Apples-to-apples comparison: Every tool scans the same vulnerable programs under identical conditions.
Continuous improvement & healthy competition: Tool developers can continuously run benchmarks to improve their engines and refine rules over time.
Existing Synthetic Benchmarks
I explored the commonly used suites and why they are not ideal for assessing SAST tools. These popular suites are useful but have fundamental limits that make them imperfect proxies for real-world performance:
OWASP Benchmark (Java-only): A free, open-source suite designed to test detection accuracy. Its narrow scope means results can be irrelevant to non-Java stacks.
SARD / Juliet (NIST): Massive test sets (primarily C/C++) mapping to CWEs. Many cases use paired “good()/bad()” patterns, which can inflate apparent accuracy and do not reflect real application complexity.
OpenSSF CVE Benchmark (JS/TS): Uses 200+ real-world JavaScript/TypeScript CVEs and runs tools on both vulnerable and patched versions to measure false negatives and false positives. Strong on realism versus synthetic suites, but currently limited to JS/TS and the covered CVE set.
It’s important to note that this does not mean such lists, test suites, and benchmarks are useless. Most tools were created to educate developers and raise awareness of common security issues. While not great for measuring a SAST tool’s true performance, these resources can still play a significant role in improving security expertise within an organization.
What to Measure: Core Evaluation Metrics
Accuracy (The Signal-to-Noise Ratio): In the context of SAST, accuracy refers to prioritizing true positives (TP) while minimizing false positives (FP). A highly accurate tool flags real, exploitable vulnerabilities and few false alarms. This concept is essentially precision: the proportion of findings that are legitimate. High precision is critical because excessive FPs create noise that overwhelms developers. If a scanner inundates the team with dozens of alerts that turn out to be non-issues, engineers will begin to ignore or disable it. As one industry guide notes, noise is the number-one factor that deters developers from using SAST products are False Positives: The Productivity Killer.
Completeness: It measures how many of the real security issues in the code the tool is able to find. It is essentially the recall (coverage of true positives). NIST defines completeness as “a measure of the real issues found (TPs) versus all possible issues (TPs and false negatives)”. A perfectly complete tool (100% recall) would catch every vulnerability in the target code. In reality, no static analyzer is perfect—some bugs will slip through (those misses are false negatives, FN). A higher completeness means fewer missed issues and better overall protection.
In practical terms, recall answers the question: how many real bugs did the tool fail to catch? If the tool misses a lot of serious vulnerabilities, its completeness is low, which could leave an application exposed. However, completeness must be balanced with accuracy. It is easy to achieve 100% recall if the tool simply flags everything as a potential issue, but then precision plummets. A good SAST tool strikes a balance: maximize detection of actual problems while keeping inaccuracies minimal. Typically, improving one metric can hurt the other, so the best tools use smart techniques to get both metrics as high as possible. As a rule of thumb, fewer false negatives is better (especially for high-severity issues).
Additional Values and Qualitative Factors: Beyond raw accuracy and completeness, it’s worth considering the unique value-adds a SAST tool brings. These qualitative factors can significantly impact the tool’s effectiveness and its fit for organization:
Coverage & depth (languages/frameworks) - Having an extensive list of supported languages and frameworks is a first step, but one must also consider how well that support translates into meaningful results. Does the tool handle modern frameworks and language-specific nuances effectively?
Rule quality and tuning process - Evaluate the quality of the tool’s default rules and the process for tuning them. A scanner with well-crafted rules (and easy customization when needed) will produce more relevant findings and fewer false alarms.
Maintenance and update velocity- How actively is the tool maintained and improved? For example, a SAST engine that relies on the open-source community to contribute new rules with minimal review ~process is~ may be prone to a high number of FPs and yield inconsistent results across different languages and vulnerability categories. Regular, rigorous updates from dedicated security researchers can improve both depth and accuracy of detection.
Integration & Workflow: Consider how well the tool integrates into the development lifecycle (CI/CD pipelines, IDE plugins, pull request annotations, etc.). A good SAST solution should enhance developer workflow by catching issues early without creating excessive friction.
Limitations of Existing Benchmarks:
If benchmarking is so important, why is it still challenging to compare SAST tools? The reality is that current benchmarks and test approaches have significant limitations:
Limited in languages: There is no one test suite or vulnerable apps that can cover multiple languages. OWASP Benchmark, for example, only contains Java issues. Similarly, intentionally vulnerable demo apps (like OWASP WebGoat, NodeGoat, Juice Shop, etc.) are each written in a specific language/framework, limiting their broader relevance.
Realism Gap: Most synthetic test cases do not mirror the complexity of production systems. Benchmark code tends to have linear data flows (e.g., a simple function taking an input and using it unsafely) and seldom involve modern frameworks or large codebases. In reality, vulnerabilities can span multiple modules, use complex library call chains, or depend on specific framework configurations – scenarios not captured in small benchmarks. A tool might ace synthetic tests yet miss issues in a large real-world codebase.
Overfitting: Once a test suite becomes a yardstick, vendors (and open-source tool authors) can tune specifically to excel at those tests. Organizations might build SAST capabilities around the known benchmark cases, yielding impressive scores that do not translate to real-world performance. Even intentionally vulnerable apps can be “gamed” or overly familiar to tools, reducing their value as unbiased benchmarks.
A Note on Using Multiple Tools: One might ask: why not use all the SAST tools together to maximize coverage? In theory, running multiple SAST engines in parallel will indeed find more issues (each may catch things the others miss) and even help cross-verify findings. However, in practice this approach multiplies the number of false positives, duplicates, formats and configuration overhead, creating a mountain of findings to sort through. The effort of maintaining multiple SAST tools and triaging overlapping findings can quickly outweigh the marginal benefit of discovering a few extra issues. In a continuous development pipeline, this is usually not sustainable.
From Claims to Proof: Validating SAST with AI
Beyond reducing noise, AI is also expanding the detection capabilities of SAST.In fact, the inspiration for this blog comes with an eye to the future. I am exploring the development of a benchmarking framework that leverages these ideas and potentially using automation or AI to run multiple SAST tools and aggregate their performance into a single score or percentage. The goal is to provide an easy-to-understand “SAST scorecard” that encapsulates how well a tool balances catching real vulnerabilities versus raising false alarms across a variety of scenarios. This is an exciting frontier, and if successful, it could give teams a much-needed, reliable way to assess SAST tools without doing all the heavy lifting themselves.
References:
OWASP Foundation. OWASP Benchmark Project. (OWASP)
van Schaik, Bas. “Introducing the OpenSSF CVE Benchmark.” OpenSSF Blog, December 9, 2020. (OpenSSF)
Hofesh, Bar. “Analyzing the Limitations of OWASP Juice Shop as a Benchmarking Target for DAST Tools.” Bright Security Blog, published February 27, 2024 (updated March 25, 2025).(Bright Security)
NIST Software Quality Group (SAMATE). Software Assurance Reference Dataset (SARD). Created February 3, 2021; updated April 22, 2024. (NIST)
National Security Agency, Center for Assured Software. CAS Static Analysis Tool Study — Methodology. December 2012 . (NIST)
Gigleux, Alexandre. “Enhancing SAST Detection: Sonar’s Scoring on the Top 3 Java SAST Benchmarks.” Sonar Blog, September 26, 2023. (SonarSource)
Biton, Asaf, and Shani Gal. “3 parameters to measure SAST testing.” Snyk Blog, August 3, 2021. (Snyk)
Biton, Asaf, and Shani Gal. “You can’t compare SAST tools using only lists, test suites, and benchmarks.” Snyk Blog, June 16, 2021. (Snyk)

