Everyone is building scanners. Almost nobody is building proof.
Teams are being asked some variation of this same question recently: “Why don’t you just use < Insert latest AI Tool / Model > for your < SAST/DAST/Vendor > stuff?”
Sometimes it is Kiro. Sometimes it is Claude Code. Sometimes it is Codex. Sometimes it is whatever new cyber model just dropped with a slick name, a leaderboard screenshot, and a benchmark score that looks like it was designed to end budget conversations on sight.
The name changes, but the assumption is always the same: if AI can find vulnerabilities, why are we still using security consultants, running a bug bounty program, or buying security products?
It is a fair question. It is also one of those questions that sounds simpler than it is.
The problem is not that the technology is fake. The problem is almost the opposite. The technology is real enough, impressive enough, and useful enough that it can make smart people believe they are farther along than they actually are. A good demo can create a lot of confidence very quickly. In security, confidence that has not been earned is where things start getting expensive.
Agents Are Amazing
Let’s get something out of the way. Agents are awesome. I am not skeptical of AI agents. I think they are one of the most important advances application security has seen in years.
They can pseudo-reason through code in ways older tools could not. They can investigate instead of just pattern-match. They can follow execution paths, inspect surrounding context, use tools, connect evidence, and often uncover vulnerabilities that traditional approaches either miss or bury under noise. For anyone who has spent years watching AppSec tools struggle with context, this is not a small improvement. It is a meaningful shift.
That is why I care so much about getting this right. The more powerful the technology becomes, the more careful we need to be about what we think it proves.
A Finding Is Not a Program
Every week I see another post from someone who built an AI security tool. It might be a multi-model harness, an autonomous pentesting agent, a pull request reviewer, or a scanner wired together from a model, a few tools, and a clever workflow. The system runs. It finds something real. The screenshot looks good. The explanation is convincing.
Then the comments start.
Why do we need vendors anymore? Why can’t our security team just build this? Why are we still paying consultants? Why do we have all these AppSec tools? Why don’t we just use Kiro for our SAST work?
Fair questions but the conversation skipped quite a few important steps along the way.
Finding the first vulnerability is not the finish line. It is the starting line.
A demo asks, “Can this thing find a bug?”
A security program has to ask something much harder: can we depend on this system over time?
That question changes everything.
That is a very different problem than producing a good demo.
The Three-Week Trap
I have seen the same pattern play out enough times that it is starting to feel familiar.
An engineer spends a few weeks building an internal harness. The harness calls a model, looks at code, produces findings, and some of those findings are real. Everyone gets excited. There is something remarkable about seeing a system reason its way into a vulnerability that would have taken a human a long time to find.
Then leadership sees the output and draws a very tempting conclusion: if we built this in three weeks, why are we spending so much money on products, vendors, consultants, and specialists?
That conclusion is understandable. It is also dangerous.
The following are questions leadership should be asking, instead:
- Can it find the same issue tomorrow?
- Can it find the same issue a hundred times in a row?
- Can it classify findings consistently?
- Can it track them over time?
- Can it survive model changes, rate limits, and outages?
- Can it survive production traffic?
- Can it survive cost constraints?
- Can it survive edge cases?
- Is any of this provable?
- How are we measuring that proof?
- How will we know when we get it wrong?
Once you start asking those questions, the engineering cost and reality begin to surface.
AI makes this trap easier to fall into because the output looks polished. The report has structure. The language is confident. The vulnerability may be legitimate. It feels like a product. But a finding that looks right is not the same thing as a system you can trust.
The Work Nobody Wants to Demo
The industry is understandably focused on scanners because scanners are easy to explain. Point this at your target and it finds problems. That is a clean story. It fits in a demo. It works in a launch post.
But the moment you get detection working, an entirely new set of challenges appear.
Vulnerability identity is a real problem. Deduplication is a real problem. Regression testing is a real problem. False negative analysis is a real problem. Benchmarking, evidence quality, observability, cost control, noise suppression, triage workflows, and knowing when a system has drifted are all difficult engineering challenges.
These are not side quests. They are the difference between “we found a bug” and “we can run a security program around this.”
Security teams do not just need findings. They need assurance.
AI Makes Measurement Harder, Not Optional
Agentic systems introduce a kind of variability that security teams are not used to dealing with at this level.
The same code can produce different investigation paths across runs. The model might inspect different files, endpoints, or tools. It might prioritize different evidence, or explain the issue in a different way. One run might describe a vulnerability as an authorization bypass. Another might call the same underlying issue insecure direct object reference. Another might frame it as broken access control. The titles change. The severity can change. The evidence chain can change. The recommended fix can change.
The code did not change. The vulnerability did not change. The model’s path through the problem changed.
That is fine when you are experimenting. It is much harder when you are trying to operate a security program.
This is why measurement cannot be treated as an afterthought. With AI security systems, measurement is part of the product. If you cannot evaluate the system, you cannot safely depend on the system.
Production & Hard Lessons
One of the mistakes I made early was assuming that as models got better, the surrounding engineering would get much easier. Better reasoning would mean fewer orchestration problems. Larger context windows would mean less complexity. Stronger models would smooth out the rough edges.
While it is true that better models led to incrementally better analysis I can confidently say that engineering did not get easier.
Rate limits still matter. Cost still matters. Latency still matters. Tool failures still matter. Agentic steering still matters. Caching still matters. Observability still matters. Evaluation still matters. Repeatability still matters. None of that goes away because the model got smarter.
Production traffic has a way of finding every assumption you made in the prototype and politely setting it on fire.
Real repositories are messy. Build systems fail in ways no one put in the happy-path demo. Monorepos punish naive assumptions. Frameworks get customized until they barely resemble the documentation. Engineering teams do things that make perfect sense for their business and absolutely no sense for your tool’s architecture. Extraneous model usage in a prototype is easy to ignore. Extraneous model usage across thousands of scans can become a budget problem in minutes.
A missed tool call in a demo is a bug to fix later. A missed tool call in production can mean a missed vulnerability. A flaky workflow in a side project is tolerable. A flaky workflow that developers and security teams depend on becomes a trust problem.
And once a security tool loses trust, getting it back is hard.
The Question Leaders Should Be Asking
When someone tells me they have built an internal AI security platform, I don’t ask which model they are using or what agentic framework and tools they’re employing or even about their design.
What I want to know is how they are measuring it.
Not in the abstract. Not “we looked at some results and they seemed good.” I mean actual evaluation.
- What is the test set, what are the conditions, and how is it measured?
- What vulnerabilities should the system find? What does it consistently miss?
- How repeatable are the results? How are findings fingerprinted? How are duplicates handled?
- What happens when the model changes? What happens when you tweak your agentic flow?
- What happens when the tools fail or you hit rate limits, errors, and outages?
- How do you know whether the system improved? How do you know it did not quietly get worse?
- What is the token cost per vulnerability, at scale, over a period of time?
- How are you measuring and managing those cost fluctuations?
These are the very serious questions any AI security tool should be able to answer. The industry has gotten very good at capability demonstrations.
What we have not gotten nearly good enough at is capability validation.
This Is Not About Vendors Winning
Some companies absolutely should build their own AI security capabilities. Some have the headcount, talent, time, and the budget to do it. Some internal teams will build things vendors have not thought of yet.
I encourage people to experiment, to learn, and to build.
I would NOT, however, recommend making critical business and security decisions based on unvalidated assumptions.
The danger isn’t the demo. The danger is believing the demo represents a mature capability. The danger is cutting expertise before you've validated what that expertise was doing. The danger is restructuring processes and teams around confidence that hasn't been earned.
Because if we're wrong, the cost isn't a bad LinkedIn post. The cost is burned-out defenders. The cost is missed vulnerabilities. The cost is security teams trapped between old tools they can't replace and new tools they can't trust. The cost is breaches.
So yes, please, build! It is the best education you can receive so that you can have truly informed conversations on the subject. Just remember that the first vulnerability feels like a breakthrough. The next ten thousand are where engineering starts. Adjust business decisions accordingly.
Proof Is the Part That Matters
The question underneath all of this is not whether AI is significantly better at detecting vulnerabilities. It is. Full stop. That argument is over.
The question is whether we can build systems around AI that are consistent enough, measurable enough, observable enough, and trustworthy enough to support real security work. That is the part that will separate useful products from impressive demos.
That work is less exciting than the first screenshot. It is also where the value is.
If we get this wrong, security teams will end up stuck between old tools they no longer trust and new tools they cannot yet depend on. Leaders will make budget decisions based on confidence that was never validated. And the vulnerabilities we miss will not care how good the demo looked.
That is the measurement problem.
The first vulnerability is where the excitement starts. The proof is where the security starts.
See DryRun in Action
See how DryRun helps security teams move from AI-generated findings to security outcomes they can actually measure and trust. Get Started.




