Just Clear a Day: What We Learned Running an AI Security Hackathon

Jaime Arze

June 05, 2026

Security teams are professionally trained to be skeptical of new technology. That is the job. We assess risk, we model emerging threats, we assume the worst and plan for it. So when AI showed up, our first instinct was to treat it like everything else, a new attack surface to defend.

That instinct is right but it's also incomplete.

Because while most teams are focused on tactical work and modeling today's threat landscape, security teams continue to drown in work. Alert queues that never empty. New sources for signal, CVEs that need triaging. Evidence that needs to be pulled. The same technology we spend our days scrutinizing is also the thing that could give teams hours or even days back on their week.

So we stopped debating it and we ran a hackathon. One day. Four teams. No slides. Just build something that solves a real problem you have right now. They did. And the gap between “we should use AI for this” and “here it is, running” turned out to be a lot smaller than anyone expected.

Why a hackathon, and why now

There is a particular kind of organizational paralysis around AI adoption. Everyone agrees it matters. Everyone has written some prompts. Written a few papers. But few teams have shipped anything their people actually use day to day.

A hackathon cuts through that, and it does it cheaply. You are not committing to a roadmap or a budget line. You are committing to a single day and a simple premise: pick a problem you understand intimately, and see how far you can get toward solving it before the clock runs out.

What you get back is disproportionate to what you put in. You learn which problems are actually tractable with current tooling and which are not. You find the people on your team who are quietly excellent at this. And you replace abstract anxiety with concrete evidence, which is the only thing that ever really moves an organization.

The value is the same whether you have ten people or one hundred. The mechanics scale down cleanly. A three person team can run a focused afternoon sprint and walk away with something real.

How to run one that actually works

We learned a few things, some of them the hard way. If you are going to do this, here is what matters.

Scope is everything. The single biggest predictor of whether a team shipped was whether they picked a narrow problem and went deep. The temptation to build the entire platform is real and it will kill your demo every time. Force teams to define their minimum viable product in the proposal, before they write a line of code. "A script that enriches fifty indicators and outputs a threat brief" is a good scope. "An autonomous SOC" is not.

Make people propose before they build. We required a short proposal a week out: target user, the friction point, the impact metric, the tools. This does two things. It forces clarity of thought before anyone touches a keyboard, and it lets leadership flag scope creep or missing access while there is still time to fix it.

Confirm tool access before the clock starts, not after. Nothing wastes a sprint faster than discovering at 10am that nobody has the API key. Build a prep window into the schedule and treat it as mandatory.

Build in a midpoint checkpoint. Our single most useful agenda item was a thirty minute status check halfway through the morning. Every team showed what was running, named their biggest blocker, and got a scope decision from their sponsor on the spot. The teams that were behind did not catch up by working harder. They caught up by cutting scope, and the checkpoint is where that call got made.

Stop building before you think you should. We enforced a hard stop ninety minutes before demos. New features after that point are not features. They are risks. The time goes to rehearsal and making sure the thing does not break when someone else runs it.

Judge on the right things. We scored on three criteria: impact, feasibility, and innovation, weighted in that order. Impact came first deliberately. A clever tool that solves a fake problem is worthless. Feasibility came second because "great idea, impossible to ship" is the most common hackathon failure mode. Innovation came last so teams would not over rotate on complexity at the expense of actually solving something.

A clean structure for a single day looks roughly like this: a short kickoff to lock scope and confirm assignments, a first build sprint, a midpoint checkpoint, a second longer sprint, a hard stop for prep and a dry run, then demos and judging. Eight hours is comfortable. Six is doable if your teams come in prepared.

What we built

We ran teams of three, drawn from across the security org and paired deliberately so that people who do not normally work together had to (credit to diversity of thought). Here is what they shipped.

Atlas, an ownership resolver

Problem: Like most organizations at scale, data can live across multiple systems. Answering detailed product questions across a broad portfolio can mean piecing together context from several places. That institutional knowledge is valuable and surfacing it on demand can slow down cross team hand-offs or time sensitive analysis during an investigation.

Architecture: A RAG-based conversational agent that indexes all four sources and reconciles them with a confidence model. When at least three of four sources agree on an owner, the answer is high confidence. When they conflict, Atlas surfaces the conflict alongside its best guess and flags it for human review. It returns structured answers with citations back to the originating system, and a nightly job regenerates canonical product cards in markdown that the team can review and commit back to the architecture artifacts repo as a ratified source of truth. Two interfaces were planned, a CLI and a Slack bot, with the CLI delivered on the day.

Stack: A frontier LLM for entity resolution and Q&A, drawing on our source-control, documentation, team-chat, and issue-tracking systems for repo and ownership data, subject-matter-expert lists, behavioral ownership signals (who actually answers questions about which products), and risk assignments.

They shipped a working CLI returning cited answers, and they did it a person down with a teammate out sick. The Slack bot and web UI are next.

Vendor Assessment Tool, a faster third-party risk workflow

Problem: Third-party risk assessments are slow and almost entirely manual. An analyst reads vendor documentation, cross-references it against questionnaires, maps controls, and writes a structured report, every step done by hand across disconnected files.

Architecture: A Claude Code CLI tool that ingests vendor documentation and questionnaire responses, reasons across them, and produces a structured TPR assessment report. The analyst runs the tool; Claude Code handles the document analysis, gap identification, and report generation, with Hyperproof TPRM supplying the control framework context. The story worth telling here is the scoping. The team originally proposed IRIS, an ambitious eight-agent threat analysis pipeline spanning six source systems, and during the sprint they cut it down to this focused vendor assessment slice. That was the right call, and it is exactly the instinct a hackathon is meant to teach.

Stack: Claude Code on the CLI, Hyperproof TPRM for vendor and control context, and local vendor documentation and questionnaire files as inputs.

The narrow version works. They ran a complete live assessment in the demo, and the analyst who owns this workflow could use it the next day.

EvidenceRelay, Compliance done easy

Problem: Audit evidence collection is a high-volume, repetitive task. An evidence request lands as a ticket, and an engineer pivots into the right evidence source for that control, runs a query, captures the result, and attaches it back. It is exactly the kind of structured, well-understood workflow where giving an analyst a faster path is a clear win.

Architecture: A human-triggered, multi-agent pipeline. A SecOps engineer opens the ticket and explicitly invokes the agent, never automatically. The agent reads the ticket, identifies the correct evidence source, runs the query, and pulls the result. Then a second agent in an auditor role independently validates that output and assigns a confidence score before anything is written. If confidence is low, it populates partial evidence and explicitly flags what still needs human input, with no silent guessing. On completion it writes a cited comment with the tool, query, raw result, and timestamp, sets a ready-for-review label, and reassigns to a human who reviews and closes. Evidence then syncs to the GRC platform through the existing integration, with every field that sync depends on preserved.

Stack: A frontier agent SDK, an MCP connection to our issue tracker, and direct API and MCP connections to our cloud security, EDR, and SIEM platforms. No new cloud infrastructure; it runs against existing tool APIs.

None of these are production systems. They are proofs of concept built in a day. But several are close enough that the real question is no longer "is this possible" but "what would it take to ship it." That question is where the actual roadmap value lives, and we are already working through it.

The takeaway

The most important thing we walked away with was not a tool. It was the evidence that the barrier to using this technology is lower than people assume. The gap is mostly permission and time, not skill. Our people already understood the problems intimately. They just needed a day and a mandate to go solve them.

The threat side of AI is real and we will keep doing the work required. But the opportunity side is just as real, and security teams do not have the luxury of sitting it out. The defender with a tool doing the first pass on five hundred alerts is a fundamentally different defender than the one doing it by hand.

If you lead a team and you have not done this yet, the advice is simple. Clear a day. Give people the space. You will be surprised what they build when you get out of the way.

In this Article

Why a hackathon, and why now
How to run one that actually works
The takeaway

Just Clear a Day: What We Learned Running an AI Security Hackathon

Jaime Arze

Why a hackathon, and why now

How to run one that actually works

What we built

Atlas, an ownership resolver

Vendor Assessment Tool, a faster third-party risk workflow

EvidenceRelay, Compliance done easy

The takeaway

EDB heads to PGConf.Brasil 2026, this is what we’ll be talking about!

Jumping the gun: looking ahead at PostgreSQL 19

Meeting in Montreal: Developer U plan(ner) patches

Just Clear a Day: What We Learned Running an AI Security Hackathon

Jaime Arze

Why a hackathon, and why now

How to run one that actually works

What we built

Atlas, an ownership resolver

Vendor Assessment Tool, a faster third-party risk workflow

EvidenceRelay, Compliance done easy

The takeaway

More Blogs

More Blogs

EDB heads to PGConf.Brasil 2026, this is what we’ll be talking about!

Jumping the gun: looking ahead at PostgreSQL 19

Meeting in Montreal: Developer U plan(ner) patches