By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
ToolAccuracy of FindingsDetects Non-Pattern-Based Issues?Coverage of SAST FindingsSpeed of ScanningUsability & Dev Experience
DryRun SecurityVery high – caught multiple critical issues missed by othersYes – context-based analysis, logic flaws & SSRFBroad coverage of standard vulns, logic flaws, and extendableNear real-time PR feedback
Snyk CodeHigh on well-known patterns (SQLi, XSS), but misses other categoriesLimited – AI-based, focuses on recognized vulnerabilitiesGood coverage of standard vulns; may miss SSRF or advanced auth logic issuesFast, often near PR speedDecent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)Very high precision for known queries, low false positivesPartial – strong dataflow for known issues, needs custom queriesGood for SQLi and XSS but logic flaws require advanced CodeQL experience.Moderate to slow (GitHub Action based)Requires CodeQL expertise for custom logic
SemgrepMedium, but there is a good community for adding rulesPrimarily pattern-based with limited dataflowDecent coverage with the right rules, can still miss advanced logic or SSRFFast scansHas custom rules, but dev teams must maintain them
SonarQubeLow – misses serious issues in our testingLimited – mostly pattern-based, code quality orientedBasic coverage for standard vulns, many hotspots require manual reviewModerate, usually in CIDashboard-based approach, can pass “quality gate” despite real vulns
Vulnerability ClassSnyk (partial)GitHub (CodeQL) (partial)SemgrepSonarQubeDryRun Security
SQL Injection
*
Cross-Site Scripting (XSS)
SSRF
Auth Flaw / IDOR
User Enumeration
Hardcoded Token
ToolAccuracy of FindingsDetects Non-Pattern-Based Issues?Coverage of C# VulnerabilitiesScan SpeedDeveloper Experience
DryRun Security
Very high – caught all critical flaws missed by others
Yes – context-based analysis finds logic errors, auth flaws, etc.
Broad coverage of OWASP Top 10 vulns plus business logic issuesNear real-time (PR comment within seconds)Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk CodeHigh on known patterns (SQLi, XSS), but misses logic/flow bugsLimited – focuses on recognizable vulnerability patterns
Good for standard vulns; may miss SSRF or auth logic issues 
Fast (integrates into PR checks)Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)Low - missed everything except SQL InjectionMostly pattern-basedLow – only discovered SQL InjectionSlowest of all but finished in 1 minuteConcise annotation with a suggested fix and optional auto-remedation
SemgrepMedium – finds common issues with community rules, some missesPrimarily pattern-based, limited data flow analysis
Decent coverage with the right rules; misses advanced logic flaws 
Very fast (runs as lightweight CI)Custom rules possible, but require maintenance and security expertise
SonarQube
Low – missed serious issues in our testing
Mostly pattern-based (code quality focus)Basic coverage for known vulns; many issues flagged as “hotspots” require manual review Moderate (runs in CI/CD pipeline)Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities
Vulnerability ClassSnyk CodeGitHub Advanced Security (CodeQL)SemgrepSonarQubeDryRun Security
SQL Injection (SQLi)
Cross-Site Scripting (XSS)
Server-Side Request Forgery (SSRF)
Auth Logic/IDOR
User Enumeration
Hardcoded Credentials
VulnerabilityDryRun SecuritySemgrepGitHub CodeQLSonarQubeSnyk Code
1. Remote Code Execution via Unsafe Deserialization
2. Code Injection via eval() Usage
3. SQL Injection in a Raw Database Query
4. Weak Encryption (AES ECB Mode)
5. Broken Access Control / Logic Flaw in Authentication
Total Found5/53/51/51/50/5
VulnerabilityDryRun SecuritySnykCodeQLSonarQubeSemgrep
Server-Side Request Forgery (SSRF)
(Hotspot)
Cross-Site Scripting (XSS)
SQL Injection (SQLi)
IDOR / Broken Access Control
Invalid Token Validation Logic
Broken Email Verification Logic
DimensionWhy It Matters
Surface
Entry points & data sources highlight tainted flows early.
Language
Code idioms reveal hidden sinks and framework quirks.
Intent
What is the purpose of the code being changed/added?
Design
Robustness and resilience of changing code.
Environment
Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.
KPIPattern-Based SASTDryRun CSA
Mean Time to Regex
3–8 hrs per noisy finding set
Not required
Mean Time to Context
N/A
< 1 min
False-Positive Rate
50–85 %< 5 %
Logic-Flaw Detection
< 5 %
90%+
Severity
CriticalHigh
Location
utils/authorization.py :L118
utils/authorization.py :L49 & L82 & L164
Issue
JWT Algorithm Confusion Attack:
jwt.decode() selects the algorithm from unverified JWT headers.
Insecure OIDC Endpoint Communication:
urllib.request.urlopen called without explicit TLS/CA handling.
Impact
Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).
Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation
Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.
Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight
This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself
This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.
AI in AppSec
June 16, 2026

The AI Security Industry Has a Measurement Problem

Everyone is building scanners. Almost nobody is building proof.

Teams are being asked some variation of this same question recently: “Why don’t you just use < Insert latest AI Tool / Model > for your < SAST/DAST/Vendor > stuff?”

Sometimes it is Kiro. Sometimes it is Claude Code. Sometimes it is Codex. Sometimes it is whatever new cyber model just dropped with a slick name, a leaderboard screenshot, and a benchmark score that looks like it was designed to end budget conversations on sight.

The name changes, but the assumption is always the same: if AI can find vulnerabilities, why are we still using security consultants, running a bug bounty program, or buying security products?

It is a fair question. It is also one of those questions that sounds simpler than it is.

The problem is not that the technology is fake. The problem is almost the opposite. The technology is real enough, impressive enough, and useful enough that it can make smart people believe they are farther along than they actually are. A good demo can create a lot of confidence very quickly. In security, confidence that has not been earned is where things start getting expensive.

Agents Are Amazing

Let’s get something out of the way. Agents are awesome. I am not skeptical of AI agents. I think they are one of the most important advances application security has seen in years.

They can pseudo-reason through code in ways older tools could not. They can investigate instead of just pattern-match. They can follow execution paths, inspect surrounding context, use tools, connect evidence, and often uncover vulnerabilities that traditional approaches either miss or bury under noise. For anyone who has spent years watching AppSec tools struggle with context, this is not a small improvement. It is a meaningful shift.

That is why I care so much about getting this right. The more powerful the technology becomes, the more careful we need to be about what we think it proves.

A Finding Is Not a Program

Every week I see another post from someone who built an AI security tool. It might be a multi-model harness, an autonomous pentesting agent, a pull request reviewer, or a scanner wired together from a model, a few tools, and a clever workflow. The system runs. It finds something real. The screenshot looks good. The explanation is convincing.

Then the comments start.

Why do we need vendors anymore? Why can’t our security team just build this? Why are we still paying consultants? Why do we have all these AppSec tools? Why don’t we just use Kiro for our SAST work?

Fair questions but the conversation skipped quite a few important steps along the way.

Finding the first vulnerability is not the finish line. It is the starting line.

A demo asks, “Can this thing find a bug?” 

A security program has to ask something much harder: can we depend on this system over time?

That question changes everything. 

That is a very different problem than producing a good demo.

The Three-Week Trap

I have seen the same pattern play out enough times that it is starting to feel familiar.

An engineer spends a few weeks building an internal harness. The harness calls a model, looks at code, produces findings, and some of those findings are real. Everyone gets excited. There is something remarkable about seeing a system reason its way into a vulnerability that would have taken a human a long time to find.

Then leadership sees the output and draws a very tempting conclusion: if we built this in three weeks, why are we spending so much money on products, vendors, consultants, and specialists?

That conclusion is understandable. It is also dangerous.

The following are questions leadership should be asking, instead:

  • Can it find the same issue tomorrow?
  • Can it find the same issue a hundred times in a row?
  • Can it classify findings consistently?
  • Can it track them over time?
  • Can it survive model changes, rate limits, and outages?
  • Can it survive production traffic?
  • Can it survive cost constraints?
  • Can it survive edge cases?
  • Is any of this provable?
  • How are we measuring that proof?
  • How will we know when we get it wrong?

Once you start asking those questions, the engineering cost and reality begin to surface.

AI makes this trap easier to fall into because the output looks polished. The report has structure. The language is confident. The vulnerability may be legitimate. It feels like a product. But a finding that looks right is not the same thing as a system you can trust.

The Work Nobody Wants to Demo

The industry is understandably focused on scanners because scanners are easy to explain. Point this at your target and it finds problems. That is a clean story. It fits in a demo. It works in a launch post.

But the moment you get detection working, an entirely new set of challenges appear.

Vulnerability identity is a real problem. Deduplication is a real problem. Regression testing is a real problem. False negative analysis is a real problem. Benchmarking, evidence quality, observability, cost control, noise suppression, triage workflows, and knowing when a system has drifted are all difficult engineering challenges.

These are not side quests. They are the difference between “we found a bug” and “we can run a security program around this.”

Security teams do not just need findings. They need assurance.

AI Makes Measurement Harder, Not Optional

Agentic systems introduce a kind of variability that security teams are not used to dealing with at this level.

The same code can produce different investigation paths across runs. The model might inspect different files, endpoints, or tools. It might prioritize different evidence, or explain the issue in a different way. One run might describe a vulnerability as an authorization bypass. Another might call the same underlying issue insecure direct object reference. Another might frame it as broken access control. The titles change. The severity can change. The evidence chain can change. The recommended fix can change.

The code did not change. The vulnerability did not change. The model’s path through the problem changed.

That is fine when you are experimenting. It is much harder when you are trying to operate a security program.

This is why measurement cannot be treated as an afterthought. With AI security systems, measurement is part of the product. If you cannot evaluate the system, you cannot safely depend on the system.

Production & Hard Lessons

One of the mistakes I made early was assuming that as models got better, the surrounding engineering would get much easier. Better reasoning would mean fewer orchestration problems. Larger context windows would mean less complexity. Stronger models would smooth out the rough edges.

While it is true that better models led to incrementally better analysis I can confidently say that engineering did not get easier.

Rate limits still matter. Cost still matters. Latency still matters. Tool failures still matter. Agentic steering still matters. Caching still matters. Observability still matters. Evaluation still matters. Repeatability still matters. None of that goes away because the model got smarter.

Production traffic has a way of finding every assumption you made in the prototype and politely setting it on fire.

Real repositories are messy. Build systems fail in ways no one put in the happy-path demo. Monorepos punish naive assumptions. Frameworks get customized until they barely resemble the documentation. Engineering teams do things that make perfect sense for their business and absolutely no sense for your tool’s architecture. Extraneous model usage in a prototype is easy to ignore. Extraneous model usage across thousands of scans can become a budget problem in minutes. 

A missed tool call in a demo is a bug to fix later. A missed tool call in production can mean a missed vulnerability. A flaky workflow in a side project is tolerable. A flaky workflow that developers and security teams depend on becomes a trust problem.

And once a security tool loses trust, getting it back is hard.

The Question Leaders Should Be Asking

When someone tells me they have built an internal AI security platform, I don’t ask which model they are using or what agentic framework and tools they’re employing or even about their design.

What I want to know is how they are measuring it.

Not in the abstract. Not “we looked at some results and they seemed good.” I mean actual evaluation. 

  • What is the test set, what are the conditions, and how is it measured? 
  • What vulnerabilities should the system find? What does it consistently miss? 
  • How repeatable are the results? How are findings fingerprinted? How are duplicates handled? 
  • What happens when the model changes? What happens when you tweak your agentic flow? 
  • What happens when the tools fail or you hit rate limits, errors, and outages?
  • How do you know whether the system improved? How do you know it did not quietly get worse?
  • What is the token cost per vulnerability, at scale, over a period of time? 
  • How are you measuring and managing those cost fluctuations?

These are the very serious questions any AI security tool should be able to answer. The industry has gotten very good at capability demonstrations. 

What we have not gotten nearly good enough at is capability validation.

This Is Not About Vendors Winning

Some companies absolutely should build their own AI security capabilities. Some have the headcount, talent, time, and the budget to do it. Some internal teams will build things vendors have not thought of yet. 

I encourage people to experiment, to learn, and to build.

I would NOT, however, recommend making critical business and security decisions based on unvalidated assumptions.

The danger isn’t the demo. The danger is believing the demo represents a mature capability. The danger is cutting expertise before you've validated what that expertise was doing. The danger is restructuring processes and teams around confidence that hasn't been earned.

Because if we're wrong, the cost isn't a bad LinkedIn post. The cost is burned-out defenders. The cost is missed vulnerabilities. The cost is security teams trapped between old tools they can't replace and new tools they can't trust. The cost is breaches.

So yes, please, build! It is the best education you can receive so that you can have truly informed conversations on the subject. Just remember that the first vulnerability feels like a breakthrough. The next ten thousand are where engineering starts. Adjust business decisions accordingly.

Proof Is the Part That Matters

The question underneath all of this is not whether AI is significantly better at detecting vulnerabilities. It is. Full stop. That argument is over.

The question is whether we can build systems around AI that are consistent enough, measurable enough, observable enough, and trustworthy enough to support real security work. That is the part that will separate useful products from impressive demos.

That work is less exciting than the first screenshot. It is also where the value is.

If we get this wrong, security teams will end up stuck between old tools they no longer trust and new tools they cannot yet depend on. Leaders will make budget decisions based on confidence that was never validated. And the vulnerabilities we miss will not care how good the demo looked.

That is the measurement problem.

The first vulnerability is where the excitement starts. The proof is where the security starts.

See DryRun in Action

See how DryRun helps security teams move from AI-generated findings to security outcomes they can actually measure and trust. Get Started.