By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
ToolAccuracy of FindingsDetects Non-Pattern-Based Issues?Coverage of SAST FindingsSpeed of ScanningUsability & Dev Experience
DryRun SecurityVery high – caught multiple critical issues missed by othersYes – context-based analysis, logic flaws & SSRFBroad coverage of standard vulns, logic flaws, and extendableNear real-time PR feedback
Snyk CodeHigh on well-known patterns (SQLi, XSS), but misses other categoriesLimited – AI-based, focuses on recognized vulnerabilitiesGood coverage of standard vulns; may miss SSRF or advanced auth logic issuesFast, often near PR speedDecent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)Very high precision for known queries, low false positivesPartial – strong dataflow for known issues, needs custom queriesGood for SQLi and XSS but logic flaws require advanced CodeQL experience.Moderate to slow (GitHub Action based)Requires CodeQL expertise for custom logic
SemgrepMedium, but there is a good community for adding rulesPrimarily pattern-based with limited dataflowDecent coverage with the right rules, can still miss advanced logic or SSRFFast scansHas custom rules, but dev teams must maintain them
SonarQubeLow – misses serious issues in our testingLimited – mostly pattern-based, code quality orientedBasic coverage for standard vulns, many hotspots require manual reviewModerate, usually in CIDashboard-based approach, can pass “quality gate” despite real vulns
Vulnerability ClassSnyk (partial)GitHub (CodeQL) (partial)SemgrepSonarQubeDryRun Security
SQL Injection
*
Cross-Site Scripting (XSS)
SSRF
Auth Flaw / IDOR
User Enumeration
Hardcoded Token
ToolAccuracy of FindingsDetects Non-Pattern-Based Issues?Coverage of C# VulnerabilitiesScan SpeedDeveloper Experience
DryRun Security
Very high – caught all critical flaws missed by others
Yes – context-based analysis finds logic errors, auth flaws, etc.
Broad coverage of OWASP Top 10 vulns plus business logic issuesNear real-time (PR comment within seconds)Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk CodeHigh on known patterns (SQLi, XSS), but misses logic/flow bugsLimited – focuses on recognizable vulnerability patterns
Good for standard vulns; may miss SSRF or auth logic issues 
Fast (integrates into PR checks)Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)Low - missed everything except SQL InjectionMostly pattern-basedLow – only discovered SQL InjectionSlowest of all but finished in 1 minuteConcise annotation with a suggested fix and optional auto-remedation
SemgrepMedium – finds common issues with community rules, some missesPrimarily pattern-based, limited data flow analysis
Decent coverage with the right rules; misses advanced logic flaws 
Very fast (runs as lightweight CI)Custom rules possible, but require maintenance and security expertise
SonarQube
Low – missed serious issues in our testing
Mostly pattern-based (code quality focus)Basic coverage for known vulns; many issues flagged as “hotspots” require manual review Moderate (runs in CI/CD pipeline)Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities
Vulnerability ClassSnyk CodeGitHub Advanced Security (CodeQL)SemgrepSonarQubeDryRun Security
SQL Injection (SQLi)
Cross-Site Scripting (XSS)
Server-Side Request Forgery (SSRF)
Auth Logic/IDOR
User Enumeration
Hardcoded Credentials
VulnerabilityDryRun SecuritySemgrepGitHub CodeQLSonarQubeSnyk Code
1. Remote Code Execution via Unsafe Deserialization
2. Code Injection via eval() Usage
3. SQL Injection in a Raw Database Query
4. Weak Encryption (AES ECB Mode)
5. Broken Access Control / Logic Flaw in Authentication
Total Found5/53/51/51/50/5
VulnerabilityDryRun SecuritySnykCodeQLSonarQubeSemgrep
Server-Side Request Forgery (SSRF)
(Hotspot)
Cross-Site Scripting (XSS)
SQL Injection (SQLi)
IDOR / Broken Access Control
Invalid Token Validation Logic
Broken Email Verification Logic
DimensionWhy It Matters
Surface
Entry points & data sources highlight tainted flows early.
Language
Code idioms reveal hidden sinks and framework quirks.
Intent
What is the purpose of the code being changed/added?
Design
Robustness and resilience of changing code.
Environment
Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.
KPIPattern-Based SASTDryRun CSA
Mean Time to Regex
3–8 hrs per noisy finding set
Not required
Mean Time to Context
N/A
< 1 min
False-Positive Rate
50–85 %< 5 %
Logic-Flaw Detection
< 5 %
90%+
Severity
CriticalHigh
Location
utils/authorization.py :L118
utils/authorization.py :L49 & L82 & L164
Issue
JWT Algorithm Confusion Attack:
jwt.decode() selects the algorithm from unverified JWT headers.
Insecure OIDC Endpoint Communication:
urllib.request.urlopen called without explicit TLS/CA handling.
Impact
Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).
Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation
Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.
Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight
This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself
This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.
AI in AppSec
July 15, 2025

Constructing a Trustworthy Evaluation Methodology for Contextual Security Analysis

Contemporary software security engineering requires more than deterministic pattern‑matching. Logic flaws, authorization gaps, and configuration defects rarely manifest as regular expressions. DryRun Security addresses this gap with our agentic Contextual Security Analysis (CSA), an AI‑native engine that reasons probabilistically about code changes. Probabilistic reasoning, however, invites questions about reliability: How do we bound hallucinations, enforce instruction compliance, and sustain precision under production traffic that exceeds 25,000 pull requests each week? This article outlines our evaluation methodology and the engineering disciplines that keep CSA trustworthy.

Multi‑Pass Analytical Pipeline

Each pull request (PR) traverses three analytical passes that progressively refine context:

  1. Whole‑PR Synthesis - We assess the diff holistically to understand the feature narrative, data‑flow implications, and architectural touch points.
  2. Hunk‑Scoped Reasoning - Individual code hunks are isolated so the model can focus on localized semantics without being distracted by unrelated edits.
  3. On‑Demand Context Acquisition - Agentic workers retrieve supporting files, dependency declarations, and configuration artifacts only when required, preventing context‑window saturation while guaranteeing completeness.

This layered pipeline mimics human review: broad comprehension followed by focused scrutiny and reference checks.

LLM‑as‑Judge Evaluative Framework

DryRun Security employs a secondary, held‑out language model to audit a statistically significant random sample of production findings every 24 hours. The audit model is version‑segregated from the generation model to avoid correlated error modes and to preserve evaluation objectivity. Each sampled finding is graded along three independent axes:

  • Instruction Compliance
    Our evaluation model interrogates:
    • Does the output exactly match the required JSON schema (field names, order, and data types)?
    • Are all mandatory keys present, and are optional keys omitted when empty?
    • Do reported line numbers correspond to the actual lines in the pull‑request diff?
    • Is the commentary strictly limited to security‑relevant issues, avoiding style or performance advice?
  • Hallucination Detection
    We probe:
    • Does the analysis reference files, functions, or dependencies that do not exist in the repository?
    • Are vulnerability categories labeled accurately, without inventing exploit types?
    • Are cited variables, endpoints, and stack traces real and correctly scoped?
    • Is any remediation guidance fabricated or unrelated to the detected issue?
  • Vulnerability Correctness
    We verify:
    • Does a synthetic exploit path demonstrate that the vulnerability is reachable and exploitable?
    • Does the severity assignment align with established taxonomies such as CWE and the OWASP Top 10?

Findings that fail any axis are reviewed in a remediation queue for human triage and subsequent adjustments. While Instruction Compliance, Hallucination Detection, and Vulnerability Correctness are our foundational axes, we also evaluate additional dimensions as needed such as agent tool‑usage patterns, retrieval ordering, and other context‑handling metrics to ensure holistic coverage.

Operating at Production Scale

Moving from prototype to a system that processes tens of thousands of PRs weekly exposed scaling pathologies: truncated context, token exhaustion, and evaluation drift. We mitigated these by

  • Continuous dataset curation to reflect live customer code patterns,
  • Version‑controlling prompts and evaluation criteria as immutable contracts, and
  • Exceeding our service‑level objective pass rate; regression dashboards trigger hotfix releases when accuracy degrades.

Alignment with Industry Evaluation Principles

NVIDIA outlines six imperatives for generative‑AI evaluation: user‑satisfaction validation, coherence, benchmarking, risk identification, forward‑looking improvement, and real‑world applicability. The DryRun Security methodology operationalizes these imperatives in the application‑security domain:

  • User Satisfaction - First‑scan success is treated as a Tier‑1 quality gate; excessive false positives raise an incident.
  • Coherence - Explanations mirror senior engineer code reviews, ensuring developers receive actionable guidance.
  • Benchmarking - Each weekly CSA release competes against the prior release in a regression tournament spanning multiple languages and vulnerability corpora.
  • Risk Identification - A particular focus on category inflation that erodes developer trust.
  • Directed Improvement - Evaluation failures become templates for synthetic test generation in the next sprint.
  • Real‑World Readiness - All evaluations use authentic PRs or open‑source repositories with comparable complexity profiles.

Why Context Outperforms Patterns

Deterministic SAST tools excel at syntax‑level defects but underperform on context‑dependent risk, i.e. the addition of a payment gateway, introduction of custom RBAC rules, or a YAML change that can disable DNS. By modeling code modifications as narratives with dynamic context retrieval, CSA identifies these non‑obvious hazards while maintaining a measurable accuracy envelope.

Evaluation is not a feature: it is the control plane that enables probabilistic analysis to substitute deterministic scanners in production pipelines. Through multi‑pass context modeling, an independent LLM‑as‑Judge framework, and industrialized regression testing, DryRun Security delivers accuracy as a continuous service. The result is an AI‑powered code‑review assistant that earns trust the same way human experts do—by being consistently correct.

Context beats patterns. Accuracy, delivered.

Next Steps

Run a two‑week proof of value with DryRun Security to see which contextual risks surface in your code or book a meeting with us and we’d be happy to answer any questions you have and show you how DryRun Security can help with hard-to-find flaws.