Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of SAST Findings	Speed of Scanning	Usability & Dev Experience
DryRun Security	Very high – caught multiple critical issues missed by others	Yes – context-based analysis, logic flaws & SSRF	Broad coverage of standard vulns, logic flaws, and extendable	Near real-time PR feedback	Clear PR comments, expandable policies with no scripting or coding (NLCP)
Snyk Code	High on well-known patterns (SQLi, XSS), but misses other categories	Limited – AI-based, focuses on recognized vulnerabilities	Good coverage of standard vulns; may miss SSRF or advanced auth logic issues	Fast, often near PR speed	Decent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)	Very high precision for known queries, low false positives	Partial – strong dataflow for known issues, needs custom queries	Good for SQLi and XSS but logic flaws require advanced CodeQL experience.	Moderate to slow (GitHub Action based)	Requires CodeQL expertise for custom logic
Semgrep	Medium, but there is a good community for adding rules	Primarily pattern-based with limited dataflow	Decent coverage with the right rules, can still miss advanced logic or SSRF	Fast scans	Has custom rules, but dev teams must maintain them
SonarQube	Low – misses serious issues in our testing	Limited – mostly pattern-based, code quality oriented	Basic coverage for standard vulns, many hotspots require manual review	Moderate, usually in CI	Dashboard-based approach, can pass “quality gate” despite real vulns

Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of C# Vulnerabilities	Scan Speed	Developer Experience
DryRun Security	Very high – caught all critical flaws missed by others	Yes – context-based analysis finds logic errors, auth flaws, etc.	Broad coverage of OWASP Top 10 vulns plus business logic issues	Near real-time (PR comment within seconds)	Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk Code	High on known patterns (SQLi, XSS), but misses logic/flow bugs	Limited – focuses on recognizable vulnerability patterns	Good for standard vulns; may miss SSRF or auth logic issues	Fast (integrates into PR checks)	Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)	Low - missed everything except SQL Injection	Mostly pattern-based	Low – only discovered SQL Injection	Slowest of all but finished in 1 minute	Concise annotation with a suggested fix and optional auto-remedation
Semgrep	Medium – finds common issues with community rules, some misses	Primarily pattern-based, limited data flow analysis	Decent coverage with the right rules; misses advanced logic flaws	Very fast (runs as lightweight CI)	Custom rules possible, but require maintenance and security expertise
SonarQube	Low – missed serious issues in our testing	Mostly pattern-based (code quality focus)	Basic coverage for known vulns; many issues flagged as “hotspots” require manual review	Moderate (runs in CI/CD pipeline)	Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities

Dimension	Why It Matters
Surface	Entry points & data sources highlight tainted flows early.
Language	Code idioms reveal hidden sinks and framework quirks.
Intent	What is the purpose of the code being changed/added?
Design	Robustness and resilience of changing code.
Environment	Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.

KPI	Pattern-Based SAST	DryRun CSA
Mean Time to Regex	3–8 hrs per noisy finding set	Not required
Mean Time to Context	N/A	< 1 min
False-Positive Rate	50–85 %	< 5 %
Logic-Flaw Detection	< 5 %	90%+

	Severity
Location	utils/authorization.py :L118	utils/authorization.py :L49 & L82 & L164
Issue	JWT Algorithm Confusion Attack: jwt.decode() selects the algorithm from unverified JWT headers.	Insecure OIDC Endpoint Communication: ‍urllib.request.urlopen called without explicit TLS/CA handling.
Impact	Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).	Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation	Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.	Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight	This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself	This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

AI in AppSec

June 16, 2026

The AI Security Industry Has a Measurement Problem

Everyone is building scanners. Almost nobody is building proof.

Teams are being asked some variation of this same question recently: “Why don’t you just use < Insert latest AI Tool / Model > for your < SAST/DAST/Vendor > stuff?”

Sometimes it is Kiro. Sometimes it is Claude Code. Sometimes it is Codex. Sometimes it is whatever new cyber model just dropped with a slick name, a leaderboard screenshot, and a benchmark score that looks like it was designed to end budget conversations on sight.

The name changes, but the assumption is always the same: if AI can find vulnerabilities, why are we still using security consultants, running a bug bounty program, or buying security products?

It is a fair question. It is also one of those questions that sounds simpler than it is.

The problem is not that the technology is fake. The problem is almost the opposite. The technology is real enough, impressive enough, and useful enough that it can make smart people believe they are farther along than they actually are. A good demo can create a lot of confidence very quickly. In security, confidence that has not been earned is where things start getting expensive.

Agents Are Amazing

Let’s get something out of the way. Agents are awesome. I am not skeptical of AI agents. I think they are one of the most important advances application security has seen in years.

They can pseudo-reason through code in ways older tools could not. They can investigate instead of just pattern-match. They can follow execution paths, inspect surrounding context, use tools, connect evidence, and often uncover vulnerabilities that traditional approaches either miss or bury under noise. For anyone who has spent years watching AppSec tools struggle with context, this is not a small improvement. It is a meaningful shift.

That is why I care so much about getting this right. The more powerful the technology becomes, the more careful we need to be about what we think it proves.

A Finding Is Not a Program

Every week I see another post from someone who built an AI security tool. It might be a multi-model harness, an autonomous pentesting agent, a pull request reviewer, or a scanner wired together from a model, a few tools, and a clever workflow. The system runs. It finds something real. The screenshot looks good. The explanation is convincing.

Then the comments start.

Why do we need vendors anymore? Why can’t our security team just build this? Why are we still paying consultants? Why do we have all these AppSec tools? Why don’t we just use Kiro for our SAST work?

Fair questions but the conversation skipped quite a few important steps along the way.

Finding the first vulnerability is not the finish line. It is the starting line.

A demo asks, “Can this thing find a bug?”

A security program has to ask something much harder: can we depend on this system over time?

That question changes everything.

That is a very different problem than producing a good demo.

The Three-Week Trap

I have seen the same pattern play out enough times that it is starting to feel familiar.

An engineer spends a few weeks building an internal harness. The harness calls a model, looks at code, produces findings, and some of those findings are real. Everyone gets excited. There is something remarkable about seeing a system reason its way into a vulnerability that would have taken a human a long time to find.

Then leadership sees the output and draws a very tempting conclusion: if we built this in three weeks, why are we spending so much money on products, vendors, consultants, and specialists?

That conclusion is understandable. It is also dangerous.

The following are questions leadership should be asking, instead:

Can it find the same issue tomorrow?
Can it find the same issue a hundred times in a row?
Can it classify findings consistently?
Can it track them over time?
Can it survive model changes, rate limits, and outages?
Can it survive production traffic?
Can it survive cost constraints?
Can it survive edge cases?
Is any of this provable?
How are we measuring that proof?
How will we know when we get it wrong?

Once you start asking those questions, the engineering cost and reality begin to surface.

AI makes this trap easier to fall into because the output looks polished. The report has structure. The language is confident. The vulnerability may be legitimate. It feels like a product. But a finding that looks right is not the same thing as a system you can trust.

The Work Nobody Wants to Demo

The industry is understandably focused on scanners because scanners are easy to explain. Point this at your target and it finds problems. That is a clean story. It fits in a demo. It works in a launch post.

But the moment you get detection working, an entirely new set of challenges appear.

Vulnerability identity is a real problem. Deduplication is a real problem. Regression testing is a real problem. False negative analysis is a real problem. Benchmarking, evidence quality, observability, cost control, noise suppression, triage workflows, and knowing when a system has drifted are all difficult engineering challenges.

These are not side quests. They are the difference between “we found a bug” and “we can run a security program around this.”

Security teams do not just need findings. They need assurance.

AI Makes Measurement Harder, Not Optional

Agentic systems introduce a kind of variability that security teams are not used to dealing with at this level.

The same code can produce different investigation paths across runs. The model might inspect different files, endpoints, or tools. It might prioritize different evidence, or explain the issue in a different way. One run might describe a vulnerability as an authorization bypass. Another might call the same underlying issue insecure direct object reference. Another might frame it as broken access control. The titles change. The severity can change. The evidence chain can change. The recommended fix can change.

The code did not change. The vulnerability did not change. The model’s path through the problem changed.

That is fine when you are experimenting. It is much harder when you are trying to operate a security program.

This is why measurement cannot be treated as an afterthought. With AI security systems, measurement is part of the product. If you cannot evaluate the system, you cannot safely depend on the system.

Production & Hard Lessons

One of the mistakes I made early was assuming that as models got better, the surrounding engineering would get much easier. Better reasoning would mean fewer orchestration problems. Larger context windows would mean less complexity. Stronger models would smooth out the rough edges.

While it is true that better models led to incrementally better analysis I can confidently say that engineering did not get easier.

Rate limits still matter. Cost still matters. Latency still matters. Tool failures still matter. Agentic steering still matters. Caching still matters. Observability still matters. Evaluation still matters. Repeatability still matters. None of that goes away because the model got smarter.

Production traffic has a way of finding every assumption you made in the prototype and politely setting it on fire.

Real repositories are messy. Build systems fail in ways no one put in the happy-path demo. Monorepos punish naive assumptions. Frameworks get customized until they barely resemble the documentation. Engineering teams do things that make perfect sense for their business and absolutely no sense for your tool’s architecture. Extraneous model usage in a prototype is easy to ignore. Extraneous model usage across thousands of scans can become a budget problem in minutes.

A missed tool call in a demo is a bug to fix later. A missed tool call in production can mean a missed vulnerability. A flaky workflow in a side project is tolerable. A flaky workflow that developers and security teams depend on becomes a trust problem.

And once a security tool loses trust, getting it back is hard.

The Question Leaders Should Be Asking

When someone tells me they have built an internal AI security platform, I don’t ask which model they are using or what agentic framework and tools they’re employing or even about their design.

What I want to know is how they are measuring it.

Not in the abstract. Not “we looked at some results and they seemed good.” I mean actual evaluation.

What is the test set, what are the conditions, and how is it measured?
What vulnerabilities should the system find? What does it consistently miss?
How repeatable are the results? How are findings fingerprinted? How are duplicates handled?
What happens when the model changes? What happens when you tweak your agentic flow?
What happens when the tools fail or you hit rate limits, errors, and outages?
How do you know whether the system improved? How do you know it did not quietly get worse?
What is the token cost per vulnerability, at scale, over a period of time?
How are you measuring and managing those cost fluctuations?

These are the very serious questions any AI security tool should be able to answer. The industry has gotten very good at capability demonstrations.

What we have not gotten nearly good enough at is capability validation.

This Is Not About Vendors Winning

Some companies absolutely should build their own AI security capabilities. Some have the headcount, talent, time, and the budget to do it. Some internal teams will build things vendors have not thought of yet.

I encourage people to experiment, to learn, and to build.

I would NOT, however, recommend making critical business and security decisions based on unvalidated assumptions.

The danger isn’t the demo. The danger is believing the demo represents a mature capability. The danger is cutting expertise before you've validated what that expertise was doing. The danger is restructuring processes and teams around confidence that hasn't been earned.

Because if we're wrong, the cost isn't a bad LinkedIn post. The cost is burned-out defenders. The cost is missed vulnerabilities. The cost is security teams trapped between old tools they can't replace and new tools they can't trust. The cost is breaches.

So yes, please, build! It is the best education you can receive so that you can have truly informed conversations on the subject. Just remember that the first vulnerability feels like a breakthrough. The next ten thousand are where engineering starts. Adjust business decisions accordingly.

Proof Is the Part That Matters

The question underneath all of this is not whether AI is significantly better at detecting vulnerabilities. It is. Full stop. That argument is over.

The question is whether we can build systems around AI that are consistent enough, measurable enough, observable enough, and trustworthy enough to support real security work. That is the part that will separate useful products from impressive demos.

That work is less exciting than the first screenshot. It is also where the value is.

If we get this wrong, security teams will end up stuck between old tools they no longer trust and new tools they cannot yet depend on. Leaders will make budget decisions based on confidence that was never validated. And the vulnerabilities we miss will not care how good the demo looked.

That is the measurement problem.

The first vulnerability is where the excitement starts. The proof is where the security starts.

See DryRun in Action

See how DryRun helps security teams move from AI-generated findings to security outcomes they can actually measure and trust. Get Started.

Ken Johnson

Co-founder & CTO