Contemporary software security engineering requires more than deterministic pattern‑matching. Logic flaws, authorization gaps, and configuration defects rarely manifest as regular expressions. DryRun Security addresses this gap with our agentic Contextual Security Analysis (CSA), an AI‑native engine that reasons probabilistically about code changes. Probabilistic reasoning, however, invites questions about reliability: How do we bound hallucinations, enforce instruction compliance, and sustain precision under production traffic that exceeds 25,000 pull requests each week? This article outlines our evaluation methodology and the engineering disciplines that keep CSA trustworthy.
Multi‑Pass Analytical Pipeline
Each pull request (PR) traverses three analytical passes that progressively refine context:
- Whole‑PR Synthesis - We assess the diff holistically to understand the feature narrative, data‑flow implications, and architectural touch points.
- Hunk‑Scoped Reasoning - Individual code hunks are isolated so the model can focus on localized semantics without being distracted by unrelated edits.
- On‑Demand Context Acquisition - Agentic workers retrieve supporting files, dependency declarations, and configuration artifacts only when required, preventing context‑window saturation while guaranteeing completeness.
This layered pipeline mimics human review: broad comprehension followed by focused scrutiny and reference checks.
LLM‑as‑Judge Evaluative Framework
DryRun Security employs a secondary, held‑out language model to audit a statistically significant random sample of production findings every 24 hours. The audit model is version‑segregated from the generation model to avoid correlated error modes and to preserve evaluation objectivity. Each sampled finding is graded along three independent axes:
- Instruction Compliance
Our evaluation model interrogates:- Does the output exactly match the required JSON schema (field names, order, and data types)?
- Are all mandatory keys present, and are optional keys omitted when empty?
- Do reported line numbers correspond to the actual lines in the pull‑request diff?
- Is the commentary strictly limited to security‑relevant issues, avoiding style or performance advice?
- Hallucination Detection
We probe:- Does the analysis reference files, functions, or dependencies that do not exist in the repository?
- Are vulnerability categories labeled accurately, without inventing exploit types?
- Are cited variables, endpoints, and stack traces real and correctly scoped?
- Is any remediation guidance fabricated or unrelated to the detected issue?
- Vulnerability Correctness
We verify:- Does a synthetic exploit path demonstrate that the vulnerability is reachable and exploitable?
- Does the severity assignment align with established taxonomies such as CWE and the OWASP Top 10?
Findings that fail any axis are reviewed in a remediation queue for human triage and subsequent adjustments. While Instruction Compliance, Hallucination Detection, and Vulnerability Correctness are our foundational axes, we also evaluate additional dimensions as needed such as agent tool‑usage patterns, retrieval ordering, and other context‑handling metrics to ensure holistic coverage.
Operating at Production Scale
Moving from prototype to a system that processes tens of thousands of PRs weekly exposed scaling pathologies: truncated context, token exhaustion, and evaluation drift. We mitigated these by
- Continuous dataset curation to reflect live customer code patterns,
- Version‑controlling prompts and evaluation criteria as immutable contracts, and
- Exceeding our service‑level objective pass rate; regression dashboards trigger hotfix releases when accuracy degrades.
Alignment with Industry Evaluation Principles
NVIDIA outlines six imperatives for generative‑AI evaluation: user‑satisfaction validation, coherence, benchmarking, risk identification, forward‑looking improvement, and real‑world applicability. The DryRun Security methodology operationalizes these imperatives in the application‑security domain:
- User Satisfaction - First‑scan success is treated as a Tier‑1 quality gate; excessive false positives raise an incident.
- Coherence - Explanations mirror senior engineer code reviews, ensuring developers receive actionable guidance.
- Benchmarking - Each weekly CSA release competes against the prior release in a regression tournament spanning multiple languages and vulnerability corpora.
- Risk Identification - A particular focus on category inflation that erodes developer trust.
- Directed Improvement - Evaluation failures become templates for synthetic test generation in the next sprint.
- Real‑World Readiness - All evaluations use authentic PRs or open‑source repositories with comparable complexity profiles.
Why Context Outperforms Patterns
Deterministic SAST tools excel at syntax‑level defects but underperform on context‑dependent risk, i.e. the addition of a payment gateway, introduction of custom RBAC rules, or a YAML change that can disable DNS. By modeling code modifications as narratives with dynamic context retrieval, CSA identifies these non‑obvious hazards while maintaining a measurable accuracy envelope.
Evaluation is not a feature: it is the control plane that enables probabilistic analysis to substitute deterministic scanners in production pipelines. Through multi‑pass context modeling, an independent LLM‑as‑Judge framework, and industrialized regression testing, DryRun Security delivers accuracy as a continuous service. The result is an AI‑powered code‑review assistant that earns trust the same way human experts do—by being consistently correct.
Context beats patterns. Accuracy, delivered.
Next Steps
Run a two‑week proof of value with DryRun Security to see which contextual risks surface in your code or book a meeting with us and we’d be happy to answer any questions you have and show you how DryRun Security can help with hard-to-find flaws.