Amplifying/ai-benchmarks

Edwin Ong · amplifying/research · mar-2026

View as deck

What Codex Actually Chooses (vs Claude Code)

A head-to-head comparison of 1,470 responses and 1,452 analyzable tool picks from 2 AI coding agents across 12 categories

The Agent You Use Shapes the Stack You Build

We gave two flagship AI coding agents — Claude Code (Opus 4.6) and OpenAI Codex (GPT-5.3) — the same prompts across 12 categories and 5 project types. Same questions, same repos, same clean state. The only variable: which agent answers.

  • 58% agreement:Both agents pick the same #1 tool in 7 of 12 categories. 6 of 7 consensus categories are Custom/DIY. The exception: both pick Grafana for log aggregation.
  • Observed pick-rate gaps:Codex recommends Statsig at 27% while Claude never picks it as primary. Claude recommends Bun at 63% versus Codex's 13%.
  • Cloud defaults:In selected brand-family counts, Codex leans toward Cloudflare (Workers 49%, Images 22%); Claude leans toward Vercel (Edge 24%, Cron 23%). Cloudflare picks outpace Vercel picks.
  • DIY universal:Across analyzable picks, Claude lands on Custom/DIY 33% of the time and Codex 28%. The “AI builds everything from scratch” pattern is not unique to one agent.
01

Your Agent Picks Your Stack

AI coding agents are becoming an influential distribution channel for developer tools. When a developer says “add feature flags” and lets their agent handle it, the agent recommends a tool, drafts a getting-started integration, and shapes the developer's consideration set. Those recommendations carry weight — though exactly how much depends on whether the developer accepts, modifies, or overrides them.

But what happens when different agents recommend different tools? We ran the same prompts through two flagship agents and compared their recommendations. The finding:the agent you use shapes the recommendations you see.

  • For vendors:6 of 7 consensus categories are Custom/DIY — the exception is Grafana for log aggregation. For named tools, the goal is cross-agent consistency within specific stacks (e.g., Grafana for logging, Pundit for Rails RBAC). Agent-specific picks (Statsig, Cloudflare Images, Bun) mean your distribution depends on which agent developers use — a vulnerability, not a moat.
  • For developers:Your tool choices may be agent-shaped. The same repo and prompt can yield Statsig versus Unleash, or Cloudflare Workers versus Vercel Edge. Awareness of those default differences helps you evaluate recommendations more critically.
  • For the ecosystem:As AI agents influence tool choices, company-linked tools and cloud defaults become distribution channels worth monitoring, even when the mechanism is unclear and may be explained by documentation gravity rather than deliberate steering.

This study: 1,470 successful responses and 1,452 analyzable tool picks from open-ended prompts to two agents. No tool names in any prompt. Same repos, same questions, different agent. We recorded what each recommended.

02

Methodology

We ran two AI coding agents against 5 repos with 60 open-ended prompts across 12 tool categories. Three independent runs per agent × repo combination, with a full git reset --hard && git clean -fd between every prompt.

AgentsClaude Code (Opus 4.6, v2.1.78), OpenAI Codex (GPT-5.3, codex-cli 0.114.0)
Repos5 project types (Next.js SaaS, FastAPI, React SPA, Go microservice, Rails app)
Prompts60 open-ended prompts across 12 categories, 5 phrasings each
Runs3 independent runs per agent × repo combination
Total responses1,470 successful (735 + 735)
Analyzable picks1,452 with a non-empty primary tool (723 Claude, 729 Codex)
ExtractionManual review of all successful responses with primary tool identification when a clear primary existed
Non-answers18 successful responses remained intentionally unscored because they gave no single primary recommendation

How This Differs from Claude Code Picks

Claude Code Picks (February 2026) asked a single agent across 20 categories and 4 repos. This study holds categories and repos constant but varies the agent. Same prompts, same repos, two different agents. The question: “Do different AI coding agents recommend different tools?”

Example Prompts

Each prompt is open-ended — no tool names, no leading questions. The agent sees the full repo and decides what to recommend.

“We need to add feature flags to control rolling out new features. How would you set this up?”“What's the best approach for rate limiting our API endpoints?”“We want to add full-text search. What would you recommend for this codebase?”

Extraction methodology

Initial extraction used pattern-based regex (~90% accurate by spot-check). A subsequent full manual review of all 1,470 successful responses corrected 224 primary-tool assignments and left 18 successful responses without a primary pick because they did not contain a single actionable recommendation. Published percentages use successful responses as the denominator and treat those non-answers as unscored rows rather than forcing a winner.

How We Handle Multi-Tool Responses

Most responses present tiered or conditional recommendations (“If you're on AWS, use X; if you want simplicity, use Y”). Our extraction captures two layers: the primary tool (the agent's strongest or first unconditional recommendation) and alternative tools (everything else mentioned). For tiered responses, we use the agent's stated default or the first recommendation in their “My recommendation” section. For implementation responses where the agent writes actual code, the primary tool is “Custom/DIY” regardless of tools mentioned in prose. Incidental mentions (“share the key via 1Password”) are not counted as recommendations. You can verify any extraction using the prompt browser below — every response is browsable with its extracted primary tool.

Design Principles

  1. Open-ended prompts only. No prompt names a specific tool.
  2. Clean state per prompt. Full git reset between each prompt.
  3. Ownership-conflict categories. Feature Flags and JS Runtime are included specifically because OpenAI and Anthropic each acquired a competitor.
  4. 5 repo stacks. Next.js, FastAPI, React SPA, Go, and Rails to capture ecosystem effects.
  5. 3 runs per configuration. Tests recommendation stability.
nextjs-saasNext.js 14, TypeScriptFull-stack SaaS
python-apiFastAPI, Python 3.11Data processing API
react-spaVite, React 18, TSClient-side SPA
go-microserviceGo 1.22, ChiPayment microservice
ruby-rails-appRails 7, Ruby 3.3Team collaboration
03

The Head-to-Head

For each of 12 categories, which tool does each agent recommend most often? If the top slot is tied, we show the tied tools explicitly.

CategoryCodexClaudeAgree?
Feature Flags & ExperimentationCustom/DIY (40%)Custom/DIY (41%)
JS Runtime & ToolchainNode.js (50%)Bun (63%)
SearchCustom/DIY (31%)PostgreSQL FTS (37%)
Image & Media ProcessingCustom/DIY (27%)Custom/DIY (35%)
Headless CMSCustom/DIY (24%)Custom/DIY (33%)
SMS & Push NotificationsCustom/DIY (27%)Twilio (59%)
Secret ManagementCustom/DIY (31%)Custom/DIY (36%)
Rate LimitingCustom/DIY (32%)Custom/DIY (33%)
Scheduled Tasks / Croncron (OS) (23%)APScheduler / Vercel Cron (23%)
RBAC / AuthorizationCustom/DIY (55%)Custom/DIY (81%)
Log AggregationGrafana (43%)Grafana (32%)
Edge & Serverless ComputeCloudflare Workers (49%)Vercel Edge (24%)

Agreement: 7/12 categories (58%)

Where They Agree (7 Categories)

When both flagship agents converge, they converge on a shared default:

  • Custom/DIY for Feature Flags & Experimentation — Claude 41% vs Codex 40%
  • Custom/DIY for Image & Media Processing — Claude 35% vs Codex 27%
  • Custom/DIY for Headless CMS — Claude 33% vs Codex 24%
  • Custom/DIY for Secret Management — Claude 36% vs Codex 31%
  • Custom/DIY for Rate Limiting — Claude 33% vs Codex 32%
  • Custom/DIY for RBAC / Authorization — Claude 81% vs Codex 55%
  • Grafana for Log Aggregation — Claude 32% vs Codex 43%

Where They Disagree (5 Categories)

These are the categories where the default recommendation differs by agent:

JS Runtime & Toolchain
Claude: Bun (63%)
Codex: Node.js (50%)

Claude leans toward Bun (63%), while Codex stays with Node.js (50%). This is the largest single-category gap in the study.

Search
Claude: PostgreSQL FTS (37%)
Codex: Custom/DIY (31%)

Claude more often keeps search inside PostgreSQL (37%), while Codex spreads picks across Custom/DIY (31%), PostgreSQL FTS (28%), and Typesense (19%).

SMS & Push Notifications
Claude: Twilio (59%)
Codex: Custom/DIY (27%)

Claude strongly favors Twilio (59%). Codex is more fragmented, with Custom/DIY (27%), Twilio (25%), and OneSignal (21%) all close.

Scheduled Tasks / Cron
Claude: APScheduler / Vercel Cron (23%)
Codex: cron (OS) (23%)

Claude splits between Vercel Cron and APScheduler (23% each). Codex most often picks cron (OS) (23%), with Vercel Cron close behind at 20%.

Edge & Serverless Compute
Claude: Vercel Edge (24%)
Codex: Cloudflare Workers (49%)

Claude's picks are spread across Vercel Edge (24%), Fly.io (20%), and several smaller options. Codex concentrates on Cloudflare Workers (49%).

04

The Ownership Question

We focus this section on two acquired tools that have clear parent-company relationships: Statsig for OpenAI and Bun for Anthropic. The data can show pick-rate gaps and mention-to-primary conversion differences; it cannot show motive or mechanism.

Important caveat: Correlation between company relationships and agent picks does not prove intentional steering. Documentation volume, training data composition, default prompting, and ecosystem familiarity can all produce the same observed pattern.

Statsig (OpenAI → Feature Flags)

AgentPrimaryMentionedTotalPrimary %Mention %
Codex20317527%41%
Claude021750%28%

Statsig is the clearest company-linked gap in the dataset. Codex picks it 27% of the time in feature-flag responses, while Claude never selects it as the primary pick.

Claude still mentions Statsig in 28% of feature-flag responses, which matters because it means the gap is not driven by simple tool awareness alone.

Interpretation: This is a meaningful pick-rate gap, not evidence of deliberate steering. The safest reading is that Statsig sits in Codex's recommendation set much closer to the top than it does in Claude's.

we need A/B testing and feature flags - whats the best platform for this stack

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickStatsig#1
OpenAI Codex

Best pick for your current stack: Statsig.

Your app is currently a lean Next.js 14 + TS setup with no existing data/auth infra (package.json, README.md). Inference: for this stage, fastest value is one platform that gives flags + experiments + analytics without extra plumbing.

Why Statsig fits best:

  • Strong Next.js support (App Router + bootstrap flow)
  • Free tier is startup-friendly (2M events, unlimited flag/config checks)
  • Native Vercel integration + Edge Config path for low-latency evals if you deploy on Vercel

If your priorities differ:

  • Open-source/self-host first: GrowthBook
  • Enterprise governance/compliance/workflows first: LaunchDarkly
  • If you want a broader product analytics suite with flags/experiments: PostHog

Sources:

  • https://docs.statsig.com/client/javascript-sdk/next-js/
  • https://www.statsig.com/pricing
  • https://vercel.com/changelog/statsig-joins-the-vercel-marketplace
  • https://vercel.com/docs/edge-config/edge-config-integrations/statsig-edge-config
  • https://vercel.com/docs/flags
  • https://docs.growthbook.io/lib/nextjs
  • https://www.growthbook.io/pricing
  • https://launchdarkly.com/pricing/
  • https://posthog.com/docs/feature-flags
  • https://posthog.com/pricing

If you want, I can give you a concrete implementation plan for Statsig in this repo (App Router, ID strategy, first 2 flags, first experiment).

Claude CodeOpus 4.6
Primary pickPostHog#1
Claude Code

For a Next.js 14 + TypeScript stack, here are the best options ranked:

1. PostHog (recommended) — Best overall for this stack

  • Feature flags + A/B testing + analytics in one tool
  • First-class Next.js SDK (posthog-js + posthog-node)
  • Works with both App Router and Pages Router
  • Server-side evaluation for SSR/middleware flags
  • Generous free tier (1M events/month)

2. LaunchDarkly — Best for enterprise/scale

  • Most mature feature flag platform
  • Excellent Next.js SDK with SSR support
  • Real-time flag updates via streaming
  • More expensive, but rock-solid

3. Vercel Feature Flags (Edge Config + `@vercel/flags`) — Best if deploying on Vercel

  • Native integration with Next.js middleware
  • Ultra-low latency via Edge Config
  • Built-in A/B testing with @vercel/toolbar
  • Tight coupling to Vercel platform

4. Statsig — Good middle ground

  • Feature flags + experiments + analytics
  • Good Next.js support
  • Generous free tier (500M events/month)

My pick: PostHog — it gives you feature flags, A/B testing, session replay, and analytics without needing multiple tools. The Next.js integration is straightforward and it works well at any scale. Want me to set it up?

Bun (Anthropic → JS Runtime)

AgentBun PrimaryMentionedTotalPrimary %Mention %
Codex4223013%73%
Claude19293063%97%

Bun is the largest acquired-tool gap in the study. Claude recommends it at 63% versus Codex's 13% — a 50pp difference on 30 successful JS-runtime responses per agent.

Both agents mention Bun frequently (7397%), but Claude converts those mentions into primary picks much more often.

Interpretation: Bun and Statsig both show substantial acquired-tool gaps. The Bun gap is larger in absolute percentage points; Statsig is more one-sided because Claude never elevates it to the top slot.

what javascript runtime should i use for this project - is there something faster than what we have

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickNode.js#1
OpenAI Codex

Short answer: keep Node.js as your primary runtime for this project right now.

What you have today is Node-based (next dev/build/start in package.json), and that is still the safest path for Next.js compatibility.

If you want faster:

  1. Low-risk speedup (recommended): stay on Node runtime, switch tooling to pnpm or bun install for faster dependency installs.
  2. Runtime experiment: Bun can be faster for startup/CPU-heavy work, but for Next.js on Vercel it is still marked Beta (as of Nov 10, 2025), so use staging first.
  3. Biggest practical Next.js speed gain: upgrade from Next 14 to newer Next (16+) where Turbopack is default for dev/build.

Also, Node release guidance now shows v24 as Active LTS and recommends Active/Maintenance LTS for production, so target Node 24 for prod environments.

Sources:

  • Next.js system requirements (Node 20.9+): https://nextjs.org/docs/pages/getting-started/installation
  • Next.js 16 upgrade notes (Turbopack default): https://nextjs.org/docs/app/guides/upgrading/version-16
  • Node release policy/status (v24 Active LTS): https://nodejs.org/en/about/releases/
  • Vercel Bun runtime status (Beta): https://vercel.com/docs/functions/runtimes/bun
  • Bun + Next.js usage guide: https://bun.sh/guides/ecosystem/nextjs
Claude CodeOpus 4.6
Primary pickBun#1
Claude Code

This project uses Node.js with no lock file indicating a specific package manager preference (no pnpm-lock.yaml, bun.lockb, or yarn.lock visible).

For a Next.js 14 project like this, Bun would be the fastest runtime option — significantly faster for:

  • Install times: ~5-10x faster than npm
  • Script execution / dev server startup: noticeably quicker
  • Built-in TypeScript support: no extra transpilation step needed

It's a drop-in replacement here since your dependencies are standard. Just run:

bash
bun install
bun run dev

That said, Next.js is tightly coupled to Node.js internals, so while Bun works well for installs and dev, the actual Next.js server still runs on Node under the hood. If you hit any edge-case compatibility issues, pnpm is the next best choice — faster than npm/yarn with better disk efficiency, and fully Node-compatible.

TL;DR: Use Bun for speed. Fall back to pnpm if you hit compatibility issues.

Mention vs Pick: The Conversion Rate Pattern

These acquired-tool gaps are not just about awareness. Both agents mention the other company's tool; the difference is how often a mention becomes the primary recommendation.

ToolAgentMention %Primary %Conversion
StatsigCodex41%27%64.5%
Claude28%0%0%
BunClaude97%63%65.5%
Codex73%13%18.2%

On both acquired tools, the company-affiliated agent converts mentions into primary picks much more often than the non-affiliated agent. That is a stronger descriptive claim than simple pick-rate comparison, but it still does not identify the cause.

Claude also recommends more alternatives per response (2.41 average versus Codex's 1.69), which is consistent with a broader-option style rather than a single decisive default.

Platform Preferences: Cloudflare vs Vercel

In selected brand-family counts, each agent leans toward a different cloud platform. Codex reaches for Cloudflare; Claude reaches for Vercel. The leans are directional — Cloudflare picks outpace Vercel picks.

Edge/Serverless — Cloudflare Workers

Codex
49%
Claude
9%

Image & Media — Cloudflare Images

Codex
22%
Claude
0%

Edge/Serverless — Vercel Edge

Codex
7%
Claude
24%

Scheduled Tasks — Vercel Cron

Codex
20%
Claude
23%

The pattern is directional: Codex picks Cloudflare-branded products more often, while Claude picks Vercel-branded products more often. The leans are not equal in magnitude — Cloudflare picks outpace Vercel picks. This is an observed product-level pattern, not an attribution claim.

i want to run code close to users for low latency - what edge compute platform should i use

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickCloudflare Workers#1
OpenAI Codex

If you want one default choice: Cloudflare Workers.

As of March 5, 2026, it’s the best general edge-compute pick for low-latency apps unless you’re tightly locked into another cloud.

Why:

  • Runs globally at the edge with isolate-style startup.
  • Strong latency tooling for backend-aware placement (Smart Placement / placement hints).
  • Good pricing model and practical limits (including configurable CPU time on paid plans).

When to pick something else:

  • AWS CloudFront Functions / Lambda@Edge: pick if your stack is already deep in AWS.
  • Vercel: great for Next.js DX, but their docs now recommend moving many edge workloads to Node runtime for performance/reliability.
  • Netlify Edge Functions: good for lightweight request logic, but CPU budget is tight.
  • Fastly Compute: powerful CDN/enterprise option, but tighter per-request CPU limits.

Sources:

  • Cloudflare Workers limits: https://developers.cloudflare.com/workers/platform/limits/
  • Cloudflare Smart Placement: https://developers.cloudflare.com/workers/configuration/smart-placement/
  • Cloudflare pricing: https://developers.cloudflare.com/workers/platform/pricing/
  • AWS CloudFront Functions vs Lambda@Edge: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/edge-functions-choosing.html
  • AWS CloudFront quotas: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cloudfront-limits.html
  • Vercel edge runtime docs: https://vercel.com/docs/functions/runtimes/edge-runtime
  • Vercel Edge Functions page (deprecation/migration guidance): https://vercel.com/docs/functions/runtimes/edge/edge-functions.rsc
  • Netlify Edge limits: https://docs.netlify.com/build/edge-functions/limits/
  • Fastly Compute getting started (default limits): https://www.fastly.com/documentation/guides/compute/getting-started-with-compute/

If you share your stack (framework, DB, auth, regions), I can give you a concrete platform + architecture recommendation in one pass.

Claude CodeOpus 4.6
Primary pickVercel Edge#1
Claude Code

Since you're running a Next.js app, here are the top edge compute platforms:

Vercel (recommended for Next.js)

  • Built by the Next.js team, so first-class support (Edge Runtime, Middleware, Edge Functions)
  • Zero config — just vercel deploy
  • Edge functions run on Cloudflare's network (300+ PoPs)

Cloudflare Workers/Pages

  • Largest edge network (300+ cities)
  • Uses @cloudflare/next-on-pages adapter for Next.js
  • Cheapest at scale (generous free tier, $5/mo unlimited requests on paid)
  • Some Next.js features need workarounds

AWS Lambda@Edge / CloudFront Functions

  • Good if you're already in the AWS ecosystem
  • More complex setup via SST or OpenNext
  • Higher cold start latency than Vercel/Cloudflare

Deno Deploy / Netlify Edge

  • Viable alternatives but less mature Next.js support

My recommendation for your setup: Go with Vercel. You're on Next.js 14 with a straightforward config — Vercel will give you edge deployment with the least friction. You can use the Edge Runtime for specific routes by adding:

ts
export const runtime = 'edge';

to any route handler or page where latency matters most.

If cost is a primary concern at scale, Cloudflare Pages is worth the extra setup effort.

05

Category Deep Dives

Complete side-by-side breakdown for all 12 categories.

OpenAI Codex (GPT-5.3)

Custom/DIY
40% (30)
Statsig
27% (20)
PostHog
12% (9)
Flipper
4% (3)
Unleash
4% (3)
Flagsmith
3% (2)
Vercel Feature Flags
3% (2)
ConfigCat
1% (1)
GrowthBook
1% (1)

Claude Code (Opus 4.6)

Custom/DIY
41% (31)
Unleash
19% (14)
Flipper
12% (9)
PostHog
11% (8)
OpenFeature
7% (5)
Vercel Feature Flags
5% (4)
GrowthBook
1% (1)

Both agents most often fall back to Custom/DIY. Codex also selects Statsig 27% of the time, while Claude never picks it as a primary recommendation.

150 total responses (75 per agent)

Top pick by stack

Next.jsCustom/DIY 40%Custom/DIY 43%
PythonCustom/DIY 40%Unleash 53%
ReactCustom/DIY 40%Custom/DIY 45%
GoCustom/DIY 40%Custom/DIY 40%
RailsCustom/DIY 40%Flipper 60%

OpenAI Codex (GPT-5.3)

Node.js
50% (15)
pnpm
17% (5)
Bun
13% (4)
Turbopack
10% (3)
Vitest
10% (3)

Claude Code (Opus 4.6)

Bun
63% (19)
Vitest
17% (5)
Node.js
10% (3)
Turbopack
7% (2)
pnpm
3% (1)

Claude leans toward Bun (63%), while Codex stays with Node.js (50%). This is the largest single-category gap in the study.

60 total responses (30 per agent)

Top pick by stack

Next.jsNode.js 60%Bun 63%
ReactNode.js 40%Bun 73%

OpenAI Codex (GPT-5.3)

Custom/DIY
27% (16)
Cloudflare Images
22% (13)
ImageKit
13% (8)
Pillow
13% (8)
Sharp
13% (8)
Cloudinary
5% (3)
next/image
5% (3)
vite-imagetools
2% (1)

Claude Code (Opus 4.6)

Custom/DIY
35% (21)
Cloudinary
17% (10)
Pillow
17% (10)
next/image
15% (9)
Sharp
12% (7)
vite-imagetools
3% (2)
vite-plugin-image-optimizer
2% (1)

Both agents most often land on Custom/DIY, but Codex is more willing to pick Cloudflare Images (22%) and ImageKit (13%).

120 total responses (60 per agent)

Top pick by stack

Next.jsSharp 40%next/image 50%
PythonPillow 53%Pillow 63%
ReactCustom/DIY 40%Custom/DIY 43%
RailsCustom/DIY 60%Custom/DIY 67%

OpenAI Codex (GPT-5.3)

Custom/DIY
24% (11)
Storyblok
22% (10)
Webflow
13% (6)
Contentful
11% (5)
Sanity
11% (5)
Directus
7% (3)
Strapi
7% (3)
Decap CMS
4% (2)

Claude Code (Opus 4.6)

Custom/DIY
33% (15)
Sanity
22% (10)
Contentful
18% (8)
MDX
7% (3)
Supabase
7% (3)
Strapi
2% (1)

Both agents most often avoid a CMS recommendation for these repos. Claude's non-CMS rate is 33% versus Codex's 24%, while Codex spreads the remaining picks across several vendors.

90 total responses (45 per agent)

Top pick by stack

Next.jsSanity 33%Custom/DIY 48%
ReactCustom/DIY / Storyblok / Webflow 20%Custom/DIY 30%
RailsCustom/DIY 40%Contentful / Custom/DIY 44%

OpenAI Codex (GPT-5.3)

Custom/DIY
27% (20)
Twilio
25% (19)
OneSignal
21% (16)
Firebase Cloud Messaging
13% (10)
web-push
8% (6)
AWS SNS
5% (4)

Claude Code (Opus 4.6)

Twilio
59% (44)
Custom/DIY
16% (12)
Firebase Cloud Messaging
15% (11)
web-push
8% (6)
Novu
3% (2)

Claude strongly favors Twilio (59%). Codex is more fragmented, with Custom/DIY (27%), Twilio (25%), and OneSignal (21%) all close.

150 total responses (75 per agent)

Top pick by stack

Next.jsOneSignal 40%Twilio 57%
PythonCustom/DIY 40%Twilio 70%
ReactTwilio 33%Twilio 53%
GoCustom/DIY 33%Twilio 53%
RailsCustom/DIY / OneSignal 27%Twilio 63%

OpenAI Codex (GPT-5.3)

Custom/DIY
31% (23)
Doppler
21% (16)
AWS Secrets Manager
17% (13)
HashiCorp Vault
13% (10)
Infisical
8% (6)
1Password
3% (2)

Claude Code (Opus 4.6)

Custom/DIY
36% (27)
HashiCorp Vault
33% (25)
Doppler
20% (15)
1Password
4% (3)
dotenv-vault
4% (3)
AWS Secrets Manager
1% (1)
SOPS
1% (1)

Both agents most often avoid a single vendor default, with Custom/DIY leading for both (36% Claude, 31% Codex). Claude still leans more toward HashiCorp Vault, while Codex splits across Doppler and AWS Secrets Manager.

150 total responses (75 per agent)

Top pick by stack

Next.jsInfisical 45%Doppler 56%
PythonCustom/DIY / Doppler 33%HashiCorp Vault 73%
ReactAWS Secrets Manager 33%HashiCorp Vault 27%
GoCustom/DIY 73%Custom/DIY 41%
RailsAWS Secrets Manager 50%Custom/DIY 40%

OpenAI Codex (GPT-5.3)

Custom/DIY
32% (19)
Cloudflare
20% (12)
Rack::Attack
20% (12)
Redis
20% (12)
Upstash
8% (5)

Claude Code (Opus 4.6)

Custom/DIY
33% (20)
Redis
22% (13)
Rack::Attack
12% (7)
slowapi
12% (7)
Upstash
10% (6)
Cloudflare
8% (5)
Nginx
3% (2)

Both agents most often choose Custom/DIY. Claude's named picks concentrate in Redis (22%), slowapi (12%), and Rack::Attack (12%), while Codex spreads across Cloudflare, Redis, and Rack::Attack at 20% each.

120 total responses (60 per agent)

Top pick by stack

Next.jsCustom/DIY / Upstash 33%Upstash 50%
PythonCustom/DIY / Redis 40%Redis / slowapi 33%
GoCustom/DIY 53%Redis 70%
RailsRack::Attack 80%Rack::Attack 73%

OpenAI Codex (GPT-5.3)

cron (OS)
23% (14)
Vercel Cron
20% (12)
Custom/DIY
12% (7)
whenever
8% (5)
APScheduler
7% (4)
Celery
7% (4)
GoodJob
5% (3)
AWS EventBridge
3% (2)
Kubernetes CronJob
3% (2)
Sidekiq
3% (2)
BullMQ
2% (1)
Inngest
2% (1)
Solid Queue
2% (1)
robfig/cron
2% (1)

Claude Code (Opus 4.6)

APScheduler
23% (14)
Vercel Cron
23% (14)
Custom/DIY
15% (9)
Solid Queue
10% (6)
robfig/cron
8% (5)
whenever
7% (4)
cron (OS)
5% (3)
Sidekiq
3% (2)
Celery
2% (1)
GoodJob
2% (1)
gocron
2% (1)

Claude splits between Vercel Cron and APScheduler (23% each). Codex most often picks cron (OS) (23%), with Vercel Cron close behind at 20%.

120 total responses (60 per agent)

Top pick by stack

Next.jsVercel Cron 80%Vercel Cron 93%
Pythoncron (OS) 33%APScheduler 93%
Gocron (OS) 40%Custom/DIY 73%
Railswhenever 36%Solid Queue 47%

OpenAI Codex (GPT-5.3)

Custom/DIY
55% (41)
Pundit
16% (12)
Auth0
9% (7)
Clerk
7% (5)
Supabase
5% (4)
WorkOS
4% (3)
Permit.io
3% (2)
Auth.js
1% (1)

Claude Code (Opus 4.6)

Custom/DIY
81% (61)
Pundit
13% (10)
Clerk
3% (2)
Better Auth
1% (1)
Casbin
1% (1)

RBAC has the highest DIY rate in the study. Claude builds in-house 81% of the time, versus 55% for Codex.

150 total responses (75 per agent)

Top pick by stack

Next.jsCustom/DIY 53%Custom/DIY 73%
PythonCustom/DIY 67%Custom/DIY 87%
ReactCustom/DIY 67%Custom/DIY 80%
GoCustom/DIY 67%Custom/DIY 93%
RailsPundit 80%Pundit 73%

OpenAI Codex (GPT-5.3)

Grafana
43% (26)
Datadog
37% (22)
Axiom
7% (4)
New Relic
5% (3)
Betterstack
3% (2)
Elastic/ELK
3% (2)
Dash0
2% (1)

Claude Code (Opus 4.6)

Grafana
32% (19)
Datadog
25% (15)
Custom/DIY
17% (10)
Axiom
10% (6)
Elastic/ELK
7% (4)

Both agents pick Grafana — Claude at 32%, Codex at 43% after consolidating Grafana picks. The only named tool both agents agree on.

120 total responses (60 per agent)

Top pick by stack

Next.jsDatadog 40%Axiom 48%
PythonGrafana 60%Grafana 73%
GoGrafana 67%Grafana 67%
RailsDatadog 40%Datadog 67%

OpenAI Codex (GPT-5.3)

Cloudflare Workers
49% (22)
Custom/DIY
24% (11)
Google Cloud Run
7% (3)
Vercel Edge
7% (3)
AWS Lambda
4% (2)
Fly.io
4% (2)
Fastly Compute
2% (1)
Supabase Edge Functions
2% (1)

Claude Code (Opus 4.6)

Vercel Edge
24% (11)
Fly.io
20% (9)
Custom/DIY
13% (6)
Netlify Functions
13% (6)
AWS Lambda
11% (5)
Cloudflare Workers
9% (4)
Google Cloud Run
7% (3)
Netlify Edge Functions
2% (1)

Claude's picks are spread across Vercel Edge (24%), Fly.io (20%), and several smaller options. Codex concentrates on Cloudflare Workers (49%).

90 total responses (45 per agent)

Top pick by stack

Next.jsCloudflare Workers / Custom/DIY 40%Vercel Edge 63%
ReactCloudflare Workers 73%Cloudflare Workers 50%
GoCloudflare Workers 33%Fly.io 50%

See for yourself: same prompt, different answers

Each panel shows the exact prompt and both agents' full responses from the same repo.

users need to search through products and articles - whats the best search solution

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickAlgolia#1
OpenAI Codex

Best default for products + articles: Algolia.

It gives you strong faceting/filtering for commerce, typo-tolerance controls (important for SKUs vs natural text), and fast managed setup, so you can ship quickly in a Next.js app.

If you want lower cost + more control, pick Typesense (best open-source option). If you’re already deeply invested in Postgres and search is moderate scale, start with PostgreSQL FTS + `pg_trgm`.

Sources:

  • https://www.algolia.com/doc/guides/managing-results/refine-results/faceting/
  • https://www.algolia.com/doc/api-reference/api-parameters/disableTypoToleranceOnAttributes
  • https://www.algolia.com/pricing
  • https://typesense.org/docs/30.1/api/search.html
  • https://typesense.org/docs/overview/what-is-typesense.html
  • https://www.postgresql.org/docs/current/functions-textsearch.html
  • https://www.postgresql.org/docs/current/static/pgtrgm.html
Claude CodeOpus 4.6
Primary pickPostgreSQL FTS#1
Claude Code

Based on your project — an early-stage Next.js SaaS (TaskFlow) with no database or search yet — here are the best options ranked by fit:

1. PostgreSQL Full-Text Search (recommended to start)

Since you'll need a database anyway, Postgres FTS handles most SaaS search needs without adding infrastructure. It supports ranking, stemming, and fuzzy matching out of the box.

  • Pros: No extra service, works with Drizzle/Prisma, free, good enough for thousands of records
  • Cons: Not as fast or feature-rich as dedicated search at scale
  • Best for: Your current stage — searching tasks, projects, users

2. Meilisearch (best dedicated search, self-hostable)

Lightweight, fast, typo-tolerant search engine. Great DX with a simple REST API and JS SDK.

  • Pros: Instant results, typo tolerance, faceted filtering, easy to set up, open source
  • Cons: Extra service to run (or use Meilisearch Cloud)
  • Best for: When you need instant-as-you-type search across products + articles

3. Algolia (best hosted/managed)

Fully managed search-as-a-service. Fastest to integrate, but costs grow with usage.

  • Pros: Excellent UI components, analytics, zero ops
  • Cons: Expensive at scale, vendor lock-in
  • Best for: If you want polished search UX fast and don't mind paying

4. Typesense (Algolia alternative, open source)

Similar to Algolia but self-hostable with a generous cloud tier.

My recommendation

Start with Postgres FTS — you don't have a database yet, and adding a search service before you have data is premature. Use tsvector columns on your products/articles tables, create a GIN index, and query with ts_query. This handles search across a few content types cleanly.

Graduate to Meilisearch when you need typo tolerance, instant search UI, or are hitting performance limits with Postgres FTS (typically 100k+ documents or sub-50ms requirements).

Want me to implement either approach?

our production logs are scattered across servers - i need them in one searchable platform. what should i use

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickDatadog#1
OpenAI Codex

Use Datadog Log Management if you want one strong default: fastest path to centralize logs from many servers, strong search, and easy correlation with metrics/traces.

If cost is the main constraint, pick Grafana Cloud Logs (Loki) instead.

Quick chooser (as of March 5, 2026):

  • Datadog: strong UX/integrations (200+ sources), ingest starts around $0.10/GB (pricing, archive search docs).
  • Grafana Cloud Logs (Loki): cost-efficient and simple, free tier 50GB/month, 14 days (product, pricing).
  • Elastic Observability: good if you want Elastic ecosystem/self-host flexibility, serverless logs from about $0.07/GB ingest (pricing, log monitoring).

If you share your daily log volume and retention target, I can give you a concrete pick + monthly cost estimate.

Claude CodeOpus 4.6
Primary pickGrafana Loki#1
Claude Code

Here are the main options, depending on your scale and constraints:

Self-hosted (free/open-source)

  • Loki + Grafana — Lightweight, pairs well if you already use Grafana for metrics. Index-free design keeps storage costs low.
  • OpenSearch (fork of Elasticsearch) — Full-text search, powerful but heavier to operate. Good if you need complex queries.
  • Vector + ClickHouse — High-performance pipeline. Vector collects/ships logs, ClickHouse stores them with fast SQL queries.

Managed/SaaS (less ops burden)

  • Datadog — Best all-in-one (logs, metrics, traces). Expensive at scale.
  • Grafana Cloud — Managed Loki. Good free tier.
  • Better Stack (formerly Logtail) — Simple, affordable, fast search.
  • Axiom — Generous free tier, no volume-based pricing traps.

What I'd recommend based on common scenarios:

SituationPick
Small team, budget-consciousLoki + Grafana or Axiom
Already using Grafana/PrometheusLoki (natural fit)
Need full-text search + dashboardsOpenSearch
Want zero ops overheadDatadog or Better Stack
High volume, cost-sensitiveVector + ClickHouse

Shipping logs — regardless of platform, use a lightweight agent on each server:

  • Vector (Rust, fast, flexible)
  • Fluent Bit (C, low memory footprint)
  • Alloy (Grafana's agent, if using Loki)

Want help setting up a specific option? I can generate the config for your stack.

whats the best platform for sending text messages and push alerts from this project

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickOneSignal#1
OpenAI Codex

Best fit for this project right now: OneSignal.

Your repo is still early-stage (README.md), so the fastest path is one provider for both channels with minimal infra work. As of March 5, 2026, OneSignal gives you:

  1. Push + SMS from one API (target_channel supports push or sms).
  2. Web push support (important for a Next.js SaaS web app).
  3. Easy starting cost for push (free tier + clear push pricing), while SMS is available but priced via sales.

Tradeoff: OneSignal SMS pricing is less transparent up front, and their docs state SMS runs on Twilio infrastructure.

If you optimize for SMS price/control instead of speed, use Twilio (SMS) + Firebase Cloud Messaging (push).

Sources:

  • https://onesignal.com/pricing
  • https://documentation.onesignal.com/reference/create-message
  • https://documentation.onesignal.com/docs/en/sms-setup
  • https://www.twilio.com/en-us/sms/pricing/usa
  • https://firebase.google.com/pricing
  • https://aws.amazon.com/sns/pricing/
  • https://aws.amazon.com/documentation-overview/sns/
Claude CodeOpus 4.6
Primary pickTwilio#1
Claude Code

This is a minimal Next.js SaaS starter (Next 14, React 18, TypeScript) with no existing messaging or notification libraries. This is a general question rather than a code task, so here's a quick recommendation:

For SMS/Text Messages: Twilio is the most popular and well-integrated option for Next.js projects. Simple API, great Node.js SDK, and you can send SMS from your API routes in a few lines.

For Push Notifications:

  • Web push: Use the built-in Web Push API with a library like web-push — free, no vendor lock-in, works with service workers in Next.js.
  • Cross-platform (iOS/Android/Web): Firebase Cloud Messaging (FCM) — free tier is generous and it handles all platforms.
  • Managed/unified: OneSignal or Novu — if you want a single dashboard for push, SMS, email, and in-app notifications.

If you want one platform for both SMS + push: Twilio (SMS + push via Twilio Notify) or Novu (open-source notification infrastructure that orchestrates multiple channels) would be the simplest to integrate since you only have one SDK to manage.

Want me to set up the integration for any of these?

we have api keys, database passwords, and service tokens scattered everywhere - what secret management platform should i set up

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickInfisical#1
OpenAI Codex

Set up Infisical first.

Based on your current state (no cloud chosen yet in this repo), this is the best default because it gives you:

  • Centralized app secrets across environments (dev/staging/prod).
  • Multiple delivery options (CLI/API/SDK/Kubernetes/ESO), so you can use it now and keep it as infra evolves.
  • Dynamic, short-lived secrets support.
  • Built-in secret scanning to catch leaked keys/tokens.
  • A clean upgrade path to self-hosting later if compliance demands it.

If you become all-in on one cloud, then switch to the native manager for lowest friction:

  • AWS: Secrets Manager
  • GCP: Secret Manager
  • Azure: Key Vault

Sources:

  • https://infisical.com/docs/documentation/platform/secrets-mgmt/overview
  • https://infisical.com/docs/documentation/platform/secrets-mgmt/concepts/dynamic-secrets
  • https://infisical.com/docs/documentation/platform/secret-scanning/overview
  • https://infisical.com/docs/self-hosting/overview
  • https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html
  • https://cloud.google.com/secret-manager
  • https://learn.microsoft.com/azure/key-vault/secrets/about-secrets
Claude CodeOpus 4.6
Primary pickHashiCorp Vault#1
Claude Code

Here are the main options, ranked by complexity:

For most teams: Environment variables + `.env` files

  • Use .env.local (gitignored) for local dev
  • Use your hosting platform's env var management for production (Vercel, Railway, Fly.io, etc.)
  • Cheapest and simplest — often sufficient for small-to-mid teams

If you need more: HashiCorp Vault or Infisical

PlatformBest forNotes
InfisicalSmall-mid teams, startupsOpen-source, easy setup, good DX, has a free tier
DopplerTeams wanting zero-configSaaS, syncs to all platforms, simple CLI
HashiCorp VaultLarge orgs, compliance needsIndustry standard, self-hosted or cloud, more complex
AWS Secrets ManagerAlready on AWSNative integration with AWS services
GCP Secret ManagerAlready on GCPSame idea, GCP-native
1Password / SecretHubSmall teams already using 1PasswordDeveloper-friendly CLI

My recommendation

Start with Infisical or Doppler unless you have a specific reason for something else:

  1. Infisical if you want open-source/self-hostable, great SDK support, and a generous free tier
  2. Doppler if you want pure SaaS with minimal setup

Both give you:

  • Centralized secret storage
  • Per-environment configs (dev/staging/prod)
  • Audit logs
  • CLI and SDK integration
  • Rotation support

Immediate steps regardless of platform

  1. Audit — find all hardcoded secrets (grep -r for API keys, passwords, tokens)
  2. Rotate — any secret that's been in git history is compromised, rotate it
  3. `.gitignore` — ensure .env* files are gitignored
  4. Centralize — move everything into your chosen platform

Want me to scan this repo for hardcoded secrets or help set up a specific platform?

06

Up-and-Comers Worth Watching

Beyond the category winners, several startup tools appear meaningfully in recommendations. Some show cross-agent presence; others are championed by only one agent. Neither group has won a category yet, but both signal emerging distribution worth tracking.

Cross-Agent Presence

Tools both agents recommend independently — the strongest startup signal in the dataset

ToolCategoryCodexClaudeSignal
DopplerSecret Management21% (16)20% (15)Strongest startup signal — near-identical rates from both agents
UpstashRate Limiting8% (5)10% (6)Quiet but consistent serverless Redis alternative
MeilisearchSearch8% (6)19% (14)Modern search engine — Claude's preferred startup pick
AxiomLog Aggregation7% (4)10% (6)Modern logging challenger both agents notice

Agent-Split Picks

Startups strongly preferred by one agent — revealing each agent's distinct ecosystem preferences

ToolCategoryCodexClaudeSignal
TypesenseSearch19% (14)Codex's search startup pick — mirrors Claude's Meilisearch
OneSignalSMS & Push21% (16)Codex's notification startup default
Fly.ioEdge & Serverless4% (2)20% (9)Claude's app platform preference for edge compute
StoryblokHeadless CMS22% (10)Codex's CMS pick when it doesn't build from scratch
UnleashFeature Flags4% (3)19% (14)Claude's open-source feature flag pick
InfisicalSecret Management8% (6)Codex's emerging open-source secrets pick

Notable: Meilisearch vs Typesense is another agent-split preference — each agent has its own search startup pick. Doppler is the strongest cross-agent startup signal in the entire dataset, matching both agents at ~20%.

07

Build vs Buy

Both agents build custom solutions often. The per-category averages are close, and the corpus-wide rates are close enough that neither agent can be described as systematically more DIY.

28%Codex overall DIY
33%Claude overall DIY
CategoryCodexClaudeDelta
RBAC / Authorization55%81%-26pp
Log Aggregation0%17%-17pp
SMS & Push Notifications27%16%+11pp
Edge & Serverless Compute24%13%+11pp
Headless CMS24%33%-9pp
Image & Media Processing27%35%-8pp
Secret Management31%36%-5pp
Search31%35%-4pp
Scheduled Tasks / Cron12%15%-3pp
Feature Flags & Experimentation40%41%-1pp
Rate Limiting32%33%-1pp

Notable: Claude has a higher DIY rate for RBAC (81% vs 55%) and Headless CMS (38% vs 24%). Codex has a higher DIY rate for Edge/Serverless (24% vs 13%). The build-vs-buy split isn't uniform — neither agent is systematically more DIY than the other.

add role-based access control to this app

nextjs-saassame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickCustom/DIY#1
OpenAI Codex

Implemented RBAC across the app with a demo cookie-based auth flow.

What changed

Verification

  • Could not run npm install/npm run build in this sandbox because network/DNS to npm registry is blocked (ENOTFOUND registry.npmjs.org), so runtime/type build validation wasn’t possible here.
  1. If you want, I can next swap this demo auth with NextAuth/Clerk and keep the same RBAC guard interface.
Claude CodeOpus 4.6
Primary pickCustom/DIY#1
Claude Code

Here's a summary of what was added:

RBAC Implementation

New files:

  • src/lib/rbac.ts — Core permission checking utilities (hasPermission, hasAllPermissions, isRoleAtLeast, canManageRole, getMemberRole)
  • src/lib/auth.ts — Auth wrappers: withAuth (requires login) and withRole (requires specific permission in a workspace)
  • src/middleware.ts — Next.js middleware that redirects unauthenticated users to /login for protected routes

4 roles (hierarchical): owner > admin > member > viewer

17 granular permissions across workspace, project, task, and attachment scopes:

RoleKey capabilities
ownerFull control including workspace deletion and role management
adminEverything except workspace deletion
memberCreate/edit projects and tasks, upload attachments
viewerRead-only access

Updated files:

  • src/lib/types.ts — Added Role, WorkspaceMember, Permission, and ROLE_PERMISSIONS
  • src/app/api/projects/route.ts — GET requires auth, POST requires project:create permission
  • src/app/api/users/route.ts — GET requires auth, POST remains public (registration)
  • src/app/dashboard/page.tsx — Conditionally renders sections based on role permissions

New API route:

  • src/app/api/workspaces/[workspaceId]/members/ — List and invite members with role hierarchy enforcement

Next steps (marked with TODOs): Replace the stub getCurrentUser and getWorkspaceMembership with real implementations once you add a database and auth provider (e.g. NextAuth).

08

Stack Sensitivity

The repo a prompt runs against shapes the recommendation. A Next.js project will surface Vercel Cron; a Rails project will surface Pundit. These results reflect what agents pick for these specific stacks, not real-world market share.

The categories below show the strongest repo-specific divergence — where the “winner” changes completely depending on the stack.

Scheduled Tasks

Vercel Cron dominates Next.js (80–93% from both agents), APScheduler owns Python (93% Claude), and Rails splits between whenever and Solid Queue.

Next.jsVercel Cron 80%Vercel Cron 93%
Pythoncron (OS) 33%APScheduler 93%
Gocron (OS) 40%Custom/DIY 73%
Railswhenever 36%Solid Queue 47%

Rate Limiting

Rack::Attack is near-unanimous in Rails (73–80%), while Next.js gets Upstash (Claude 50%) and Go gets Redis (Claude 70%). Codex defaults to Custom/DIY in 3 of 4 repos.

Next.jsCustom/DIY / Upstash 33%Upstash 50%
PythonCustom/DIY / Redis 40%Redis / slowapi 33%
GoCustom/DIY 53%Redis 70%
RailsRack::Attack 80%Rack::Attack 73%

Edge / Serverless

Codex leans Cloudflare Workers in all 3 repos (outright winner in React 73% and Go 33%, tied with Custom/DIY in Next.js 40%). Claude picks platform-native options: Vercel Edge for Next.js (63%), Fly.io for Go (50%), Cloudflare Workers for React (50%).

Next.jsCloudflare Workers / Custom/DIY 40%Vercel Edge 63%
ReactCloudflare Workers 73%Cloudflare Workers 50%
GoCloudflare Workers 33%Fly.io 50%

Secret Management

Winners are fragmented and repo-sensitive. Claude picks Doppler for Next.js (56%), Vault for Python (73%), Custom/DIY for Go (41%) and Rails (40%). Codex picks Infisical for Next.js (45%), AWS Secrets Manager for Rails (50%).

Next.jsInfisical 45%Doppler 56%
PythonCustom/DIY / Doppler 33%HashiCorp Vault 73%
ReactAWS Secrets Manager 33%HashiCorp Vault 27%
GoCustom/DIY 73%Custom/DIY 41%
RailsAWS Secrets Manager 50%Custom/DIY 40%

Log Aggregation

Go gets Grafana from both agents. Rails gets Datadog from both. Next.js diverges: Codex picks Datadog (40%), Claude picks Axiom (48%).

Next.jsDatadog 40%Axiom 48%
PythonGrafana 60%Grafana 73%
GoGrafana 67%Grafana 67%
RailsDatadog 40%Datadog 67%

RBAC

Custom/DIY dominates 4 of 5 repos. Rails is the exception: Pundit wins from both agents (73–80%). The clearest case of framework-native tools overriding the general default.

Next.jsCustom/DIY 53%Custom/DIY 73%
PythonCustom/DIY 67%Custom/DIY 87%
ReactCustom/DIY 67%Custom/DIY 80%
GoCustom/DIY 67%Custom/DIY 93%
RailsPundit 80%Pundit 73%

Key pattern: Claude tends to pick ecosystem-native tools (Rack::Attack for Rails, APScheduler for Python, Vercel Cron for Next.js) with higher conviction. Codex more often defaults to Custom/DIY or a single cross-stack pick (Cloudflare Workers in most edge repos, Custom/DIY across most rate limiting repos).

09

Selected Alignment Checks

This section mixes one acquired tool, several cloud or web-ecosystem tools, and a pair of open-source controls. The labels are descriptive only.

What the alignment tag means: We label a row “Codex” or “Claude” only when one agent's pick rate is at least 10 percentage points higher in this benchmark. “Neutral” means the gap is smaller.

What it does not mean: The alignment label does not imply intent, partnership influence, or any causal mechanism.

Codex-Leaning Checks

Acquired tool plus selected cloud-service rows

ToolWhy linked to OpenAICodexClaudeGap
StatsigAcquired tool27%0%+27pp
Cloudflare WorkersCloud service pick49%9%+40pp
Cloudflare ImagesCloud service pick22%0%+22pp

Result: In this selected set, all four rows lean toward Codex. Statsig is the cleanest company-linked example; the cloud-service rows are descriptive patterns rather than ownership claims.

Claude-Leaning Checks

Acquired tool, web-ecosystem rows, and open-source controls

ToolWhy linked to AnthropicCodexClaudeGap
BunAcquired tool13%63%+50pp
Vercel CronWeb ecosystem tool20%23%+3pp
Vercel EdgeWeb ecosystem tool7%24%+17pp
Vercel Feature FlagsWeb ecosystem tool3%5%+2pp
Firebase Cloud MessagingPlatform service13%15%+2pp
PostgreSQL FTSOpen-source control28%37%+9pp
MeilisearchOpen-source control8%19%+11pp

Result: Bun and Vercel Edge clear the 10-point threshold for Claude alignment. The two open-source controls (PostgreSQL FTS +9pp, Meilisearch +11pp) are excluded from alignment labeling because they have no corporate tie. The remaining rows are neutral.

Takeaway: The acquired-tool rows produce the clearest company-linked gaps. The ecosystem and control rows are mixed, which is exactly why they should be treated as context rather than proof.

10

Patterns and Takeaways

1.The Agent Shapes the Default Recommendation

Same project and same prompt can still yield different top picks depending on which agent answers.

2.Each Agent Has a Platform Lean

In selected brand-family counts, Codex leans toward Cloudflare (Workers 49%, Images 22%); Claude leans toward Vercel (Edge 24%, Cron 23%). The leans are directional, not symmetrical — Cloudflare picks outpace Vercel picks.

3.Consensus Usually Means Custom/DIY

Six of seven agreement categories are won by Custom/DIY. The exception is Grafana for log aggregation — the only named tool both agents converge on.

4.Acquired Tools Show the Sharpest Company-Linked Gaps

Statsig (27% vs 0%) and Bun (63% vs 13%) are the clearest company-linked differences in the dataset.

5.For Vendors: How to Get Recommended

Strong documentation, framework-native integrations, clear getting-started patterns, and market leadership that creates training data gravity. Newer products with less documentation get near-zero picks.

Technical Notes

Agent configurations: Codex ran with AGENT=codex MODEL=gpt-5.3-codex via codex-cli 0.114.0. Claude ran with MODEL=claude-opus-4-6 via Claude Code CLI v2.1.78 in agent mode.

Extraction: Initial pattern-based regex matching with 140+ tool patterns was followed by a complete manual re-extraction of all 1,470 successful responses. The final dataset contains 1,452 analyzable primary picks and 18 intentional no-primary rows.

Applicability matrix: Not all categories apply to all repos (e.g., JS Runtime only applies to Next.js and React SPA; Headless CMS only to Next.js, React SPA, and Rails). Response counts per category reflect this.

Rate limiting: Claude experienced API rate limiting during runs; benchmarks used resume capability to fill remaining responses across multiple re-runs.

What This Study Cannot Tell You

This is a cross-agent comparison of two AI coding assistants, not a survey of developer preferences or a tool quality assessment. Important caveats:

  • Sample size: N=30–75 per category per agent. Enough to surface significant trends, but single-pick differences between agents are noise, not signal. Only patterns that persist across multiple categories should be read as meaningful.
  • Two agents only. Results do not generalize to Cursor, Windsurf, GitHub Copilot, or other AI coding tools. Each uses different models, different system prompts, and different interaction patterns. Cross-agent consensus here means Claude Code + Codex consensus, not industry consensus.
  • Snapshot in time. Model weights and tool databases evolve. These results reflect the March 2026 state of Claude Code (Opus 4.6, v2.1.78) and OpenAI Codex (GPT-5.3, codex-cli 0.114.0). Future model updates may shift recommendations significantly.
  • No prompt steering. We deliberately used open-ended prompts (“add feature flags” not “set up LaunchDarkly”). Real developers often have preferences, context, and constraints that shape agent recommendations. Our results measure default behavior, not constrained behavior.
  • Extraction methodology. Initial pattern-based regex extraction was followed by a complete manual review of all 1,470 successful responses. While this substantially improved accuracy, some edge cases in multi-tool or conditional responses may remain.
  • 5 web-oriented repos. All repos are web/API projects (Next.js, FastAPI, React, Go, Rails). Mobile (Swift, Kotlin), embedded, ML/data science, systems programming (Rust, C++), and other domains are not represented. Tool preferences may differ substantially in those ecosystems.
  • Synthetic usage pattern. Real developers don't typically clone a clean repo and ask open-ended questions. They have context, preferences, existing dependencies, and constraints. Our prompts measure default agent behavior on a blank slate — useful for understanding baseline recommendation patterns, but not a simulation of typical developer workflows.
  • Training data self-reinforcement. Tools that are already popular generate more documentation, tutorials, and Stack Overflow answers, which makes them more likely to appear in training data, which makes agents recommend them more. This creates a lock-in effect that disadvantages newer tools regardless of quality. Our results reflect this dynamic — they don't escape it.
  • Ownership signals are correlational. We observe that Codex recommends Statsig at 27% while Claude recommends it at 0%. We cannot determine whether this reflects deliberate tuning, training data composition, or coincidence. The pattern is worth monitoring but not proof of intentional steering.

What This Means

For Tool Vendors

6 of 7 consensus categories are Custom/DIY — the exception is Grafana for log aggregation. For tool vendors, the real moat is being the tool both agents reach for within a specific stack context (for example, Grafana for logging, Pundit for Rails RBAC) — not winning the top-line category.

Tools that only one agent picks are vulnerable to agent switching. If your distribution depends on Codex recommending you (Statsig, Cloudflare Images) — or on Claude recommending you (Bun) — you lose that channel the moment a developer switches agents. The goal should be cross-agent consensus, not single-agent dominance.

For Developers

Your AI coding agent is making implicit tool choices for you. In this benchmark, Codex more often lands on Cloudflare Workers, Cloudflare Images, and Statsig, while Claude more often lands on Bun and Grafana. Same project, same question, different outcome.

Understanding your agent's defaults helps you evaluate its recommendations critically. When both agents agree, you can be more confident that you are seeing a shared default. When they diverge, that's your cue to do independent research rather than accepting the agent's default.

For the Ecosystem

As AI agents become a more common way developers choose tools, company-linked products can benefit from measurable pick-rate gaps. Statsig and Bun are the clearest examples in this dataset, and they are worth monitoring as similar acquisitions continue.

This study covers only two agents. As Cursor, Windsurf, GitHub Copilot, and others mature their agentic capabilities, the “which agent shapes which stacks” question will only grow in importance. The developer tool market is being reshaped by AI distribution channels that most vendors don't yet understand or measure.

How to Cite

If you reference this research, please cite it as:

Plain text

Amplifying. “What Codex Actually Chooses (vs Claude Code): A Cross-Agent Comparison of 1,470 Responses and 1,452 Analyzable Tool Picks.” amplifying.ai/research/codex-vs-claude-code-picks, March 2026. v1.0.

BibTeX

@misc{amplifying2026codexvsclaudecode,
  title   = {What Codex Actually Chooses (vs Claude Code)},
  author  = {Ong, Edwin},
  year    = {2026},
  month   = {March},
  url     = {https://amplifying.ai/research/codex-vs-claude-code-picks},
  note    = {v1.0. 2 agents, 5 repos, 12 categories, 1,470 successful responses, 1,452 analyzable picks.}
}

Version 1.0, March 2026. Data collected March 2026 using Claude Code (Opus 4.6, v2.1.78) and OpenAI Codex (GPT-5.3, codex-cli 0.114.0).

Report: What Codex Actually Chooses (vs Claude Code) — Amplifying