Discovery
Back to browse

vulnhawk - AI-powered SAST scanner

Static analysis scanner that finds auth bypass, IDOR, and business logic bugs that Semgrep and CodeQL miss. Ships as a free GitHub Action covering Python, JS/TS, Go, PHP, and Ruby.

6 min readView source ↗

VulnHawk's framing of the SAST gap is the right one. Traditional static analysers - Semgrep, CodeQL, SonarQube - excel at known-bad patterns. They struggle with the bug class that defines real-world web app security: when 19 endpoints check authorisation and the 20th doesn't, there's no pattern to match against. The vulnerability is the absence of a pattern.

VulnHawk targets exactly that gap. For every code chunk it analyses, it pulls in related code from elsewhere in the codebase as context - other handlers in the same directory, the auth middleware, the guard patterns. The AI then has the comparison surface it needs to spot the inconsistency. That enrichment step is the whole differentiator.

Quick start

pip install vulnhawk

Pick a backend. The Claude Code and Codex backends are the headline path because they're free for existing subscribers - VulnHawk pipes prompts through your local CLI rather than charging you per token:

# Free with a Claude Code Max or Team subscription
vulnhawk scan ./src -b claude-code

# Free with ChatGPT Pro or Plus
vulnhawk scan ./src -b codex

# Anthropic API (paid)
export ANTHROPIC_API_KEY=sk-ant-...
vulnhawk scan ./src

# OpenAI API (paid)
vulnhawk scan ./src -b openai -m gpt-4o

# Ollama - free, local, fully private
vulnhawk scan ./src -b ollama -m llama3.1

No config files, no rules to write, no database to build.

Per-scan cost on the API backends is real but small - around $0.50-$2.00 per scan on Claude API for ~100 files; $1-$4 on OpenAI. Ollama is the fully air-gapped path.

What it actually finds

The vulnerability classes the README is honest about:

  • Missing authorisation on 1-of-N endpoints - the bug is the absence of a check. No regex catches it.
  • IDOR / BOLA - requires understanding that the user ID in the JWT should match the ID in the URL.
  • Payment amount manipulation - business logic. The amount field shouldn't be trusted from the client.
  • Inconsistent input validation - 5 handlers sanitise, the 6th doesn't. Cross-file comparison required.
  • Stored input misuse - input saved safely, but eval()'d or raw-SQL'd three files away.
  • Race conditions in state updates - concurrent balance modifications without locking.

This is the layer Semgrep and CodeQL can be configured to catch if you write the right custom rule, but in practice nobody does. VulnHawk does the cross-file comparison automatically.

Scan modes

vulnhawk scan ./src                   # full scan
vulnhawk scan ./src --mode auth       # auth bypass, missing checks, session flaws
vulnhawk scan ./src --mode injection  # SQLi, command injection, SSTI, XSS
vulnhawk scan ./src --mode secrets    # hardcoded keys, tokens, passwords
vulnhawk scan ./src --mode config     # debug mode, permissive CORS, insecure cookies
vulnhawk scan ./src --mode crypto     # weak hashing, hardcoded keys, bad RNG

The --mode auth is the highest-signal subset for most web apps. Run it first.

Output formats

vulnhawk scan ./src -o json -f results.json     # JSON
vulnhawk scan ./src -o sarif -f results.sarif   # SARIF (GitHub Code Scanning)
vulnhawk scan ./src -o markdown -f report.md    # Markdown report

SARIF is the one to standardise on - it uploads cleanly to GitHub's Security tab via the official Code Scanning UI, and it lets you chain VulnHawk with other tools (more on that below).

SARIF input - the chained-tools workflow

This is the underrated capability. VulnHawk can ingest SARIF from Semgrep, CodeQL, Snyk, or any SARIF-emitting tool, and use those findings as additional context:

semgrep --config auto ./src -o semgrep.sarif --sarif
vulnhawk scan ./src --sarif-input semgrep.sarif

What that enables:

  • Validation - VulnHawk checks whether other tools' findings are real or false positives.
  • Discovery - finds related vulnerabilities near flagged locations.
  • Multi-step attack chains - connects findings across tools into actual exploit paths.
  • Fix verification - checks whether suggested fixes address the actual root cause.

The recommended layering, in the project's own words:

  1. Semgrep - fast, deterministic gatekeeping on known-bad patterns.
  2. CodeQL - deep taint tracking across complex call chains.
  3. VulnHawk - business logic, auth gaps, IDOR, and inconsistencies that rules can't express.

It's a complementary layer, not a replacement.

GitHub Action

VulnHawk runs as a baseline scan on the default branch and incrementally on every PR. The recommended config:

name: VulnHawk Security Scan
on:
  push:
    branches: [main, master]
  pull_request:

permissions:
  security-events: write
  contents: read

jobs:
  vulnhawk:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: momenbasel/vulnhawk@main
        with:
          target: '.'
          backend: 'claude-code'
          claude-code-oauth-token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
          severity: 'medium'
          fail-on-findings: 'true'

For the Claude Code backend in CI, get your OAuth token with claude config get oauth_token on your local machine, then add it as a GitHub Actions secret. Findings flow into the Security > Code Scanning tab via SARIF.

Languages

LanguageExtensionsFramework detection
Python.pyDjango, Flask, FastAPI
JavaScript.js .jsxExpress, Fastify, Next.js
TypeScript.ts .tsxExpress, NestJS, Fastify
Go.gonet/http handlers
Java.javaclass/method splitting
PHP.phpLaravel routes, classes, traits, interfaces
Ruby.rb .erbRails routes, classes, modules

The framework detection is the part to verify if your stack is exotic - VulnHawk uses it to know what an "endpoint" is for the purposes of the cross-comparison step. Generic Python or generic Go works; framework-aware chunking is what makes the auth-comparison story land.

How it actually works

Codebase -> Discover -> Chunk -> Enrich -> Analyze -> Validate -> Report

The Enrich step is where the magic lives. For each chunk, VulnHawk includes other functions and routes from the same directory plus auth middleware and guard patterns from across the codebase. The LLM gets the comparison surface; the comparison is the analysis.

.vulnhawkignore (gitignore syntax) excludes paths from scanning. vulnhawk info ./src is the dry-run preview - files, chunks, language breakdown - without paying for the LLM call.

When to reach for it

  • Web apps with auth, IDOR, or business-logic surface area where rule-based tools have stopped finding new bugs.
  • CI/CD pipelines that already have Semgrep or CodeQL and want a complementary layer for the gaps.
  • Security reviews of unfamiliar codebases where writing custom rules isn't worth the effort.
  • Teams with a Claude Code or Codex subscription that want SAST without a separate paid tool.

When not to

  • Air-gapped environments where you can't run any LLM. Ollama partially mitigates this if you have local GPU; otherwise this isn't the right tool.
  • Workloads where you specifically need deterministic, reproducible results across runs. AI reasoning isn't reproducible by default; this is a real trade-off.
  • Tiny scripts where pattern-based scanners catch everything you care about.

Trade-offs and limits

Code chunks are sent to the configured LLM provider unless you use Ollama. Treat that the same way you'd treat sending code to any other cloud security service - it's a normal trade-off but worth being explicit about.

Source-available license rather than full open source - free for everyone (individuals, teams, startups, enterprises) for internal use, but you can't sell it, offer it as a competing service, or redistribute forks as products. Forks for upstream PRs are explicitly allowed.

The non-deterministic property of LLM-based analysis is real. Findings on the same code can vary slightly run-to-run; the project leans on validation (a second LLM pass) to filter false positives, but expect some volatility. The right way to use it: gate on new findings between runs, not on absolute counts.

Featured in

Related entries