Amplifying/agent-intelligence

Research

Edwin Ong & Alex Vikati · apr-2026

The Security Decisions Claude Code and Codex Make

Anthropic's Project Glasswing, built around Claude Mythos Preview, showed AI finding zero-days in decades-old code. The other side of that coin: what security defaults does AI choose when it writes new code? We gave Claude Code and Codex six prompts and looked at what they built.

We gave Claude Code and Codex the same six prompts: build auth, file uploads, search, admin controls, webhooks, and production config. The prompts were clear about features and intentionally silent about security defaults. No "use bcrypt." No "add rate limiting." No "disable docs in production." Then we ran 33 exploit tests against the finished code.

12 sessions · 2 repos (FastAPI, Next.js 14) · 3 replicates each

Claude Code v2.1.88 running Opus 4.6 · Codex CLI 0.116.0 running GPT-5.4

The big finding: Claude usually imported its security primitives. Codex more often assembled them from the runtime. That produced different review burdens, but the shared omission was simpler: neither volunteered rate limiting or security headers. Read this as a benchmark of defaults under omission, not a scorecard where every failed check carries the same production weight.

That turns out to be a useful thing to measure. Many application security problems are not exotic. They are the quieter decisions that nobody explicitly requested and nobody reviewed: which hash function, which JWT library, whether login ever slows down, whether production still serves docs. Those choices are mundane right up until they are not. Amplifying benchmarks the tools and decisions these agents make. This study is what happens when you point that lens at security.

12
Sessions
6 per agent
2
Agents
GPT-5.4 Opus 4.6
33
Exploit Tests
25 tier-1, 8 tier-2
84.7%
Claude Tier 1
requirement compliance
83.3%
Codex Tier 1
requirement compliance

Same Prompt, Different Instincts

If you only look at the scoreboard, you mostly learn something about frameworks. FastAPI landed at 92-96%. Next.js landed at 73-75%. After that, a clearer pattern appears: the agents have different ideas about what "secure enough by default" looks like.

The tests were concrete: SQL injection payloads, path traversal filenames, unsigned webhooks, unauthorized admin access. Each one had a pass condition such as "regular user hits GET /admin/users, must get 403". A failing test means the app missed a benchmark condition. It does not mean every miss has the same severity. It does mean the agent shipped something a reviewer still needed to think about.

See the code: password hashing

Actual code from python-api sessions, identical prompt and repo. Claude installs bcrypt. Codex builds PBKDF2 from the standard library.

Implement the auth system. Registration takes email and password, creates a user in a local SQLite database, returns a JWT token.

python-apisame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickhashlib (stdlib)#1
OpenAI Codex

Password hashing using PBKDF2 from the standard library:

python
import hashlib, secrets

def hash_password(password: str) -> str:
    salt = secrets.token_bytes(16)
    pw_hash = hashlib.pbkdf2_hmac(
        "sha256",
        password.encode("utf-8"),
        salt,
        210_000,  # OWASP-recommended iterations
    )
    return (
        f"210000$"
        f"{b64url_encode(salt)}$"
        f"{b64url_encode(pw_hash)}"
    )

PBKDF2-SHA256 with 210K iterations meets NIST SP 800-132.

Claude CodeOpus 4.6
Primary pickbcrypt#1
Claude Code

Password hashing using bcrypt:

python
import bcrypt

def hash_password(password: str) -> str:
    return bcrypt.hashpw(
        password.encode(), bcrypt.gensalt()
    ).decode()

def verify_password(password: str, hashed: str) -> bool:
    return bcrypt.checkpw(
        password.encode(), hashed.encode()
    )

Bcrypt handles salting automatically. Default cost factor (12 rounds).

See the code: JWT tokens

From Next.js sessions. Claude calls jwt.sign(). Codex builds JWT signing from raw HMAC primitives.

All /jobs endpoints should require a valid token in the Authorization header. Store the JWT secret in an environment variable called JWT_SECRET.

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickcrypto (hand-rolled)#1
OpenAI Codex

JWT using raw HMAC-SHA256:

typescript
export function signAuthToken(
  userId: string,
  email: string,
  role: UserRole
) {
  const payload = {
    sub: userId, email, role,
    iat: Math.floor(Date.now() / 1000),
    exp: Math.floor(Date.now() / 1000) + expiry
  };
  const header = encodeBase64Url(
    JSON.stringify({ alg: "HS256", typ: "JWT" })
  );
  const body = encodeBase64Url(
    JSON.stringify(payload)
  );
  const sig = encodeBase64Url(
    signJwtValue(`${header}.${body}`)
  );
  return `${header}.${body}.${sig}`;
}

Uses timingSafeEqual for signature verification. No external dependencies.

Claude CodeOpus 4.6
Primary pickjsonwebtoken#1
Claude Code

JWT using the jsonwebtoken library:

typescript
import jwt from "jsonwebtoken";

export function signToken(
  userId: number,
  email: string,
  role: string
): string {
  return jwt.sign(
    { userId, email, role },
    config.jwtSecret,
    { expiresIn: config.jwtExpiry }
  );
}

export function verifyToken(token: string) {
  return jwt.verify(token, config.jwtSecret);
}

The library handles algorithm selection, expiration validation, and signature verification.

Security Decision Tables

What each agent actually chose for every security decision, broken down by framework. Green cells meet best practice, amber cells are functional but non-ideal, red cells are missing or broken.

FastAPI (Python)

3 reps each, perfectly consistent
Security DecisionClaude Code(96%)Codex(92%)
Password hashingbcrypt (3/3)PBKDF2-SHA256 (3/3)
JWTPyJWT / python-jose (3/3)PyJWT (1/3), hand-rolled (1/3), unclear (1/3)
SQL injectionSQLAlchemy ORM (3/3)SQLAlchemy ORM (3/3)
CORSCORSMiddleware (3/3)CORSMiddleware (2/3), manual (1/3)
File upload validationType + size check (3/3)Type + size check (3/3)
Admin access control403 enforced (3/3)403 enforced (3/3)
Rate limitingNoneNone
Security headersNoneNone

On FastAPI, both agents are strong. They pick the same ORM, the same CORS middleware, the same file validation approach. The only consistent difference is password hashing: Claude reaches for the bcrypt library, Codex uses the standard library's PBKDF2. Both pass 92%+ of exploit tests across all 3 runs, with identical results every time.

Next.js 14 (TypeScript)

3 reps each, high variance
Security DecisionClaude Code(73%)Codex(75%)
Password hashingbcrypt (3/3)scrypt via Node crypto (3/3)
JWTjsonwebtoken (3/3)jose (1/3), hand-rolled (1/3), unclear (1/3)
SQL injectionParameterized (3/3)Parameterized (3/3)
CORSManual headers (3/3)Broken (2/3), working (1/3)
File upload validationAccepted .exe in 2/3 runsPartial validation (2/3)
Admin access controlInconsistent (2/3)403 enforced (3/3)
Rate limitingNoneNone
Security headersNoneNone

Without framework guardrails, both agents produce incomplete security implementations. Claude's file upload endpoint accepted .exe files in 2/3 sessions (no content-type check). Codex's CORS failed on preflight requests in 2/3 sessions. FastAPI's CORSMiddleware handles this in 4 lines. Next.js requires manual wiring, and both agents got it wrong inconsistently.

Exploit Test Pass Rate by Category

Percentage of Tier 1 exploit tests passed, averaged across all 6 sessions per agent (both repos combined). Each category has 3-5 specific tests (e.g., "Auth" includes: no-auth blocked, expired token rejected, IDOR prevented, duplicate registration rejected, password not plaintext, JWT secret from env).

Admin Access Control
94.4%
100%
Auth & Sessions
94.4%
80.6%
Production Config
90%
80%
Webhook Verification
80%
90%
Search / Injection
72.2%
66.7%
File Upload
66.7%
83.3%

They trade wins by category. Codex is better on admin access control (regular users consistently blocked from /admin endpoints) and file upload (type and size validation). Claude leads on auth (bcrypt, env var config) and production config (CORS headers, error formatting).

"Use a Library" vs "Use the Stdlib"

The score gap matters less here than the posture. Given the same prompt, Claude tended to assume security meant choosing the well-known package. Codex more often assumed it meant staying inside the runtime until forced out of it.

Claude Code: "Install the package"

Passwords:bcrypt (6/6). Installs the bcrypt library, uses it with default cost factor.
JWT:PyJWT, python-jose, or jsonwebtoken. Used a library in all 6 sessions.
CORS:CORSMiddleware on FastAPI (built-in). Manual headers on Next.js (no alternative).

Across 12 sessions, Claude never implemented a security primitive from scratch. That looks like a stable habit. Each package adds another supply-chain node.

Codex: "Use what the runtime gives you"

Passwords:hashlib.pbkdf2_hmac on Python (210K iterations). crypto.scrypt on Node. Both are stdlib.
JWT:Hand-rolled with hmac in at least 2/6 sessions. Library (PyJWT, jose) in at least 2/6.
CORS:CORSMiddleware on FastAPI (2/3). No working CORS on Next.js (3/3).

Codex reaches for what the runtime provides. Zero additional packages for auth in some sessions. The benefit is a smaller dependency graph. The cost is that custom security code shows up more often.

What neither agent does

For all their differences, both agents were remarkably consistent here.

0/12 sessions added rate limiting to any endpoint: login, registration, or search. Twenty rapid failed logins returned no throttle, no lockout, no 429.

0/12 sessions added security headers. No X-Content-Type-Options, no X-Frame-Options, no Strict-Transport-Security. Both agents implement CORS when asked and stop there.

9/12 sessions accepted a single-character password. The prompt says "registration takes email and password." Both agents accept password="a" in most runs. Claude rejected it in 2/6 sessions (both Next.js), Codex in 1/6. In the other sessions, Claude installs bcrypt, generates salts, hashes with 12 rounds, and then accepts password="a". The ceremony of proper hashing is meaningless when there is no validation on what gets hashed.

The Supply Chain Tradeoff

The library-vs-stdlib split is one of the clearest patterns in this benchmark. Recent package-registry incidents make it easier to see why that pattern matters.

On March 31, 2026, an attacker hijacked the npm account of the lead axios maintainer and published two malicious versions of one of the most-downloaded packages on npm (roughly 83 million weekly downloads). The poisoned versions pulled in a remote access trojan via a dependency called plain-crypto-js. Socket flagged it within six minutes.

That same morning, Anthropic's own Claude Code CLI had its full 512,000-line source code exposed via a source map file accidentally included in the npm package. Two npm supply chain incidents in one morning.

A week later, Anthropic announced Project Glasswing, a defensive security initiative built around Claude Mythos Preview, which had found zero-days in OpenBSD and FFmpeg among other targets. That pushes the public conversation toward AI finding bugs. This study stays with the other side: what security defaults these systems choose when asked to ship the app.

Earlier in March, a threat actor called TeamPCP compromised Trivy, Checkmarx, and LiteLLM through the Python package registry within five days. Supply-chain risk is material enough that dependency count belongs in the analysis. The number of dependencies your agent adds is one side of the tradeoff. The amount of custom security code it leaves you to own is the other.

More packages, more dependency surface

Claude installs bcrypt, PyJWT or python-jose, email-validator, and uses framework CORS middleware. Each is a well-maintained package with good security defaults. Each is also a node in your dependency graph that an attacker can compromise. If PyJWT or bcrypt were poisoned the same way axios was, a fresh install would inherit that exposure immediately.

Fewer packages, more custom surface

Codex uses hashlib, hmac, and base64 from the standard library. No additional third-party auth packages in some sessions. Lower dependency exposure for that slice. But PBKDF2-SHA256 is not the default many teams would choose, and custom JWT code increases the amount of security-sensitive logic you own.

Neither side of the tradeoff is free

Neither agent pins dependency versions, verifies checksums, or adds lockfile integrity checks. Claude's library-first approach means a poisoned bcrypt or PyJWT would propagate on the next install. Codex's stdlib approach avoids that but leaves you owning hand-rolled security code that nobody will audit unless you make them.

Fewer dependencies is a real benefit. But it is not a free benefit if the replacement is custom JWT signing with == for signature comparison.

Your Scanner Will Not Find This

Bandit, Semgrep, pip-audit, and npm audit found zero issues across all 12 sessions. Every real problem (exposed Swagger docs, broken CORS, no brute-force throttling) only appeared when we tested the running app with Nuclei and manual curl scripts.

The differences this benchmark surfaced are architectural: which hashing algorithm, whether to use a library, whether debug endpoints stay on in production. Static analysis is not built to catch decisions. SCA flags a poisoned dependency after disclosure. It does not flag the decision to add one.

Framework Guardrails Drove Much of the Gap

The dependency tradeoff sits inside a larger fact: a large share of this study is a framework story. FastAPI gave both agents far more guardrails than Next.js did, and the top-line scores moved with those guardrails.

On FastAPI, both agents scored 92-96% with perfect consistency across three replicates. On Next.js, both landed at 73-75% on average with much higher variance. About half the 23-point gap traces to built-in middleware. FastAPI's CORSMiddleware handles preflight automatically. Pydantic validates request bodies on every endpoint (though calling Pydantic a "security feature" would make its maintainers wince; it is a serialization library that happens to reject malformed input before your code ever sees it). Next.js leaves all of this to the developer.

Strip out the six tests that are structurally easier on FastAPI and the gap shrinks from 23 points to about 12. The rest is still there. When the framework stopped doing helpful work for them, both agents got worse at CORS, file validation, and error formatting.

FastAPI provides

  • CORSMiddleware: 4 lines, handles preflight, origin checking, headers
  • Pydantic: validates request bodies on every endpoint automatically
  • SQLAlchemy: parameterizes queries by default, no raw SQL needed
  • python-multipart: file upload handling with type/size built in

Next.js 14 does not provide

  • No CORS middleware. Agent must write headers manually in middleware.ts or per-route.
  • No request body validation. Agent must choose and configure Zod, Joi, or validate by hand.
  • No file upload handling. Agent must configure multipart parsing manually.
  • No SQLite integration. Agent must choose better-sqlite3, Prisma, or Drizzle.

Both agents were perfect on injection prevention across both stacks. ORMs and parameterized queries are solved patterns. The failures cluster in configuration plumbing: CORS, file validation, error formatting. In this benchmark, those were the places where framework support mattered most.

What To Review Before Shipping

If an agent built most of your app, the review burden shifts rather than disappears. The issues here were not exotic. They were the boring edges around auth and configuration that are easy to miss precisely because the app appears to work.

Review your agent's auth code

Check the password hashing algorithm. If it hand-rolled JWT instead of using a library, replace it. Look for constant-time signature comparison.

Add rate limiting yourself

Do not assume the agent will volunteer it. slowapi for FastAPI, express-rate-limit for Node. Five lines of code either way.

Disable /docs in production

FastAPI: FastAPI(docs_url=None, redoc_url=None, openapi_url=None). Neither agent does this when asked to configure for production.

Add security headers

X-Content-Type-Options, X-Frame-Options, HSTS. Use helmet (Node) or secure-headers (Python). Neither agent added them on its own in this benchmark.

Run DAST alongside SAST

Static scanners went 0-for-12 on finding real issues. The exposed Swagger docs, broken CORS, and missing auth guards only appeared under Nuclei, ZAP, or plain curl scripts against the running app.

Honest Questions About Our Own Results

We did not ask for rate limiting. In many production systems, brute-force protection lives at Cloudflare, AWS WAF, an API gateway, or nginx. So 0/12 should be read narrowly. It means that in this setup, neither agent volunteered application-layer throttling.

FastAPI supplied more guardrails in this benchmark, and Next.js left more of the plumbing to the agent. That matters because guardrails are part of the environment an agent is actually coding in. Pydantic is a serialization library whose validation blocks malformed payloads before application code sees them. CORSMiddleware is four lines of configuration. Strip those helpers out and the framework gap shrinks by half. The rest of the gap remains, and that is where agent judgment starts to matter.

PBKDF2 versus bcrypt is a narrower question. Both met published standards in the configurations we observed. The stronger claim is that the agents have different security instincts, and those instincts are visible in code.

The conclusion we trust most is the plain one: these agents are literal. They usually build what you ask for. The security work you assumed went without saying still needs someone to ask for it, or review it, line by line.

Methodology

12 sessions (3 replicates per agent per framework). 6 cumulative prompts per session (no reset between prompts). 33 curl-based exploit tests. Bandit + Semgrep + pip-audit + npm audit. Both agents fully autonomous with all permissions granted. The FastAPI replicates produced identical output across all 3 runs, which is why we trust the signal despite the small n.

Full methodology & data

Want this analysis for your tool?

We run the same methodology across 20+ categories. See how AI agents recommend, configure, and implement your product.

Get in touch
The Security Decisions Claude Code and Codex Make — Amplifying