Research

What do coding agents recommend by default, and how do those defaults shift across models, repos, and prompt phrasings?

The Security Decisions Claude Code and Codex Make

April 2026

We gave Claude Code (Opus 4.6) and Codex (GPT-5.4) the same 6 prompts to build a web app, then ran 33 exploit tests against the output. Claude uses bcrypt in every session. Codex uses PBKDF2 or scrypt. Neither adds rate limiting or security headers. The framework matters more than the model: FastAPI 92-96%, Next.js 73-75%.

Study Full Report Slide Deck

Claude Code's Leak: Every Hardcoded Vendor and Tool

March 2026

Claude Code’s leaked source contains hardcoded allowlists for 37 MCP servers and 495 tool operations. GitHub has 56 classified tools. The DevOps cluster (Datadog, Grafana, Sentry, PagerDuty) has 101. Every major search provider is covered. This is a map of Anthropic’s engineering investment in the MCP ecosystem.

Study

The Tools OpenAI Agreed to Buy

March 2026

OpenAI announced plans to acquire Astral (Ruff, uv). We ran 630 benchmarks across 7 Python tooling categories. Both agents recommend Astral tools at nearly identical rates — a 4pp gap for Ruff, 0.4pp for uv. That’s notable given Bun’s 50pp gap in the same framework.

Study Codex vs Claude Code

What Codex Actually Chooses (vs Claude Code)

March 2026

We gave Claude Code (Opus 4.6) and OpenAI Codex (GPT-5.3) the same prompts across 12 categories and 5 repos. 1,452 analyzable tool picks reveal how your AI coding agent shapes what you ship — including ownership-linked gaps, platform leans, and a universal build-it-yourself default.

Overview Full Report Slide Deck

What Claude Code Actually Chooses

February 2026

A systematic survey of 2,430 Claude Code responses across 3 models, 4 project types, and 20 tool categories. What does the most popular AI coding agent choose when you ask it to pick a tool?

Overview Full Report Slide Deck Methodology GitHub

Why AI Product Recommendations Keep Changing

May 2025

We asked Google AI Mode and ChatGPT 792 product questions. The results reveal 47% cross-platform disagreement, Shopping Graph bias, and significant output drift.

Study Dashboard GitHub