Amplifying/ai-benchmarks

Research

Edwin Ong & Alex Vikati · mar-2026

The Tools OpenAI Agreed to Buy

OpenAI announced plans to acquire Astral — makers of Ruff and uv — on March 19, 2026. Ruff and uv are already the default Python toolchain. The question is whether OpenAI's own agent recommends them at different rates than Anthropic's. We ran 630 benchmarks across 7 Python tooling categories to find out.

7 categories · 3 Python repos · Claude Code (Opus 4.6, v2.1.78) vs OpenAI Codex (GPT-5.3, codex-cli 0.114.0)

The big finding: Astral tools account for 75% of primary picks — and both agents agree. The Ruff gap between Claude and Codex is 4pp; the uv gap is 0.4pp. For comparison, Bun shows a 50pp gap in the broader study.

630
Total Responses
315 per agent
2
Agents
GPT-5.3Opus 4.6
7
Categories
3 repos, 3 runs each
75%
Astral Tools
Ruff 47% + uv 28%
5/7
Agreement
2 divergent

Ruff + uv = 75% of Picks

Two tools from one vendor account for three-quarters of analyzable primary recommendations in this study. The overall model-level gaps are small, but category-level agreement is partial rather than complete.

Ruff
47%

Linting, formatting, CI/CD — Ruff dominates every code quality category.

Linting & Static Analysis100%
Code Formatting93%
CI/CD Quality Pipeline96%

Share of all picks (across all 7 categories)

Opus
49%
Codex
45%

4pp gap — includes categories where Ruff isn't applicable (type checking, venvs)

uv
28%

Packages, project setup, virtual environments — uv is the most common recommendation in these infrastructure-oriented categories.

Package & Dependency Management93%
Python Project Setup & Toolchain61%
Virtual Environments & Python Versions46%

Share of all picks (across all 7 categories)

Opus
28%
Codex
28%

0.4pp gap — includes categories where uv isn't applicable (linting, formatting, CI, type checking)

All 7 Categories

Expand any category for the full Opus vs Codex breakdown. Green badge = same winner, amber = different top pick.

Context: The Bun Contrast

This comparison mixes two datasets: Bun comes from the broader Codex vs Claude Code study, while Ruff and uv come from this dedicated Python benchmark. We include it as context for the size of the gap, not as an apples-to-apples causal test.

ToolCompany ContextClaudeCodexGap
Bun*Broader head-to-head benchmark, JS runtime63%13%50pp gap
This study (Astral Python benchmark)
RuffThis study, Python tooling49%45%4pp gap
uvThis study, Python tooling28%28%0.4pp gap
* From the broader Codex vs Claude Code study (different categories, sample size)
Bun50pp gap
Claude
63%
Codex
13%
Ruff4pp gap
Claude
49%
Codex
45%
uv0.4pp gap
Claude
28%
Codex
28%

Ruff and uv show small cross-model gaps in this dedicated benchmark. Bun shows a 50pp gap in the broader benchmark. Ruff shows 4pp. uv shows 0.4pp. That contrast is notable, but it is still only contextual because the benchmarks, categories, and sample sizes differ.

The gap is the finding, not the cause. A larger Bun gap could reflect Claude favoring Bun, Codex favoring Node.js, prompt sensitivity, or a genuine product-quality difference. We can measure the disagreement; we can't determine the direction of any error from this data alone.

Training data cutoffs, tool maturity, ecosystem momentum, prompt design, and corporate incentives can all affect recommendation rates. We show the pattern because it's notable; we don't claim which side is right, and we do not treat the Bun rows as a direct control for the Astral study.

See for Yourself

Same prompt, same repo, two different agents. These are actual responses from the benchmark — unedited.

this python project has no linting set up - what should i use

python-apisame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickRuff#1
OpenAI Codex

Use ruff.

For this project (FastAPI, Python >=3.11, pyproject.toml already present), Ruff is the best default: one tool for lint + import sorting + formatting, fast enough to run constantly.

Quick setup:

bash
pip install -e ".[dev]" ruff
toml
# pyproject.toml
[project.optional-dependencies]
dev = [
  "pytest>=8.0.0",
  "ruff>=0.6.0",
]

[tool.ruff]
line-length = 100
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "I", "B", "UP"]

[tool.ruff.format]
quote-style = "double"

Run:

bash
ruff check . --fix
ruff format .

Optional next tool: add mypy later for type checking once you add more annotations.

Sources:

  • https://docs.astral.sh/ruff/
  • https://docs.astral.sh/ruff/configuration/
  • https://docs.astral.sh/ruff/formatter/
  • https://mypy.readthedocs.io/en/stable/getting_started.html
Claude CodeOpus 4.6
Primary pickRuff#1
Claude Code

For a FastAPI project using Python 3.11+ and Hatch, I'd recommend Ruff. Here's why:

  • All-in-one: linter + formatter (replaces flake8, isort, pyupgrade, and black)
  • Fast: written in Rust, essentially instant on a project this size
  • Zero config to start: sensible defaults out of the box
  • Single tool: no need to coordinate multiple linters
  • Great pyproject.toml integration: config lives right where yours already is

To set it up, add it to your dev dependencies and add a minimal config:

toml
# in pyproject.toml

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "ruff>=0.9.0",
]

[tool.ruff]
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "I", "UP"]
# E/F = pyflakes + pycodestyle (the basics)
# I   = isort (import sorting)
# UP  = pyupgrade (modernize syntax for 3.11+)

Then run:

  • ruff check . — lint
  • ruff format . — format
  • ruff check --fix . — auto-fix what it can

Want me to add this to your pyproject.toml?

whats the best way to manage python dependencies for this project

python-apisame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickuv#1
OpenAI Codex

For this repo, the best approach is:

  1. Keep `pyproject.toml` as the single source of truth (you already do this in pyproject.toml).
  2. Use `uv` for environment + lockfile management (fast, simple, works great with FastAPI projects).
  3. Commit a lockfile (uv.lock) for reproducible installs in CI/prod.

Recommended workflow:

bash
uv sync --extra dev      # install app + dev deps into local venv
uv run uvicorn src.main:app --reload
uv run pytest

CI/prod:

bash
uv sync --frozen --no-dev

Dependency updates:

bash
uv lock --upgrade-package fastapi

If you want, I can set this up now (add uv.lock, update README commands, and add a small CI-safe install pattern).

Claude CodeOpus 4.6
Primary pickuv#1
Claude Code

You already have a pyproject.toml with dependencies declared — that's the modern standard. For a project this size, the simplest approach is `uv`:

  • Install deps: uv sync (creates a venv and lockfile automatically)
  • Add a dep: uv add httpx
  • Add a dev dep: uv add --group dev ruff
  • Run things: uv run pytest, uv run uvicorn src.main:app

Why `uv` over alternatives:

  • Much faster than pip / pip-tools / poetry
  • Generates a uv.lock lockfile for reproducible installs (your project doesn't have one yet)
  • Works directly with your existing pyproject.toml — no config changes needed
  • Manages Python versions too (uv python install 3.12)

Quick start:

bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create venv, install deps, generate lockfile
uv sync --all-groups

This will create a .venv/ directory and a uv.lock file. Commit the lockfile, gitignore the .venv/.

If you prefer something more established, pip-tools (pip-compile/pip-sync) or poetry also work, but uv is the current best choice for speed and simplicity with a standard pyproject.toml like yours.

whats the best python type checker right now

python-apisame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickPyright#1
OpenAI Codex

As of March 19, 2026, the best overall default is Pyright.

If you want stricter behavior out of the box, use BasedPyright (a Pyright fork). If you want maximum speed and can tolerate churn, ty and Pyrefly are promising but still marked beta.

Sources:

  • Pyright (latest 1.1.408, Jan 8, 2026): https://pypi.org/project/pyright/
  • Mypy (latest 1.19.1, Dec 15, 2025, Production/Stable): https://pypi.org/project/mypy/
  • ty (latest 0.0.23, Mar 13, 2026, Beta): https://pypi.org/project/ty/
  • Pyrefly (latest 0.57.0, Mar 17, 2026, Beta): https://pypi.org/project/pyrefly/
  • BasedPyright (latest 1.38.3, Mar 17, 2026): https://pypi.org/project/basedpyright/
Claude CodeOpus 4.6
Primary pickPyright#1
Claude Code

The main contenders:

  • mypy — The original, most widely used. Stable, great ecosystem support, but can be slow on large codebases.
  • pyright (via Pylance in VS Code) — Microsoft's type checker. Significantly faster than mypy, stricter by default, excellent VS Code integration. Used by most VS Code Python developers already.
  • pytype — Google's type checker. More lenient, does type inference even without annotations. Niche usage.
  • pyre — Meta's type checker. Fast, but smaller community outside Meta.

Pyright is generally considered the best right now for most use cases — it's faster, catches more issues, has better inference, and provides a superior editor experience. It's also the de facto standard if you use VS Code.

mypy is still the safe/default choice if you need maximum compatibility (e.g., lots of third-party stubs, CI pipelines that already use it, or you want the most battle-tested option).

TL;DR: Use pyright unless you have a specific reason to stick with mypy.

whats the best python code formatter right now

python-apisame repo, same prompt
OpenAI CodexGPT-5.3
Primary pickRuff#1
OpenAI Codex

As of March 19, 2026, the best default choice is Ruff Formatter (ruff format).

Why:

  • It’s designed as a Black-compatible formatter with very high output parity on Black-formatted code.
  • It’s much faster and part of a unified toolchain (formatter + linter).
  • It’s actively maintained with recent releases.

Use Black if your team specifically wants the long-established “minimal configuration, ultra-stable style” baseline and doesn’t need Ruff’s speed/integration advantages.

Sources:

  • Ruff formatter docs: https://docs.astral.sh/ruff/formatter/
  • Ruff overview: https://docs.astral.sh/ruff/
  • Ruff release updates (latest listed): https://astral.sh/blog
  • Black docs: https://black.readthedocs.io/
Claude CodeOpus 4.6
Primary pickRuff#1
Claude Code

Black remains the most popular and widely recommended Python code formatter. It's opinionated, deterministic, and requires minimal configuration — which is its main selling point.

Other strong options:

  • Ruff — rapidly gaining adoption. It's a linter *and* formatter (written in Rust), extremely fast, and largely Black-compatible. If you want formatting + linting in one tool, Ruff is the current momentum pick.
  • Blue — a less opinionated fork of Black with slightly different style choices.
  • autopep8 / yapf — older alternatives, less commonly chosen for new projects.

TL;DR: Use Ruff if you want speed and an all-in-one tool. Use Black if you want the established standard with the largest ecosystem support. For most new projects in 2026, Ruff is the pragmatic default.

Methodology

How we ran the study: same prompts, same repos, independent agents, structured extraction.

Agents

Claude CodeOpus 4.6, v2.1.78
OpenAI CodexGPT-5.3, codex-cli 0.114.0

Study Design

  • 7 categories, 5 prompts each
  • 3 Python repos (different stacks)
  • 3 independent runs per combo
  • All prompts open-ended (no tool names)

Scale

  • 630 total responses
  • 627 analyzable tool picks
  • Git-reset between prompts
  • Worktree isolation per run

Repos Used

python-api

FastAPI, Python 3.11

DataPipeline API

python-ai-agent

OpenAI SDK, Python 3.11, Pydantic

AIAgent

python-ai-assistant

FastAPI, OpenAI SDK, Python 3.11

AIAssistant

Prompt design note: All 35 prompts are open-ended. None name Ruff, uv, Black, pip, mypy, or any specific tool. Prompts ask things like “what's the best way to manage python dependencies” or “set up linting for this project.” This ensures the agent's recommendation reflects a default preference rather than direct prompt anchoring.

Limitations: This study measures default recommendation behavior on open-ended prompts; it does not establish universal tool rankings.

Each category has 5 unique prompts run against 3 repos with 3 independent runs, yielding 45 responses per model per category. The response count is meaningful, but the underlying prompt diversity is limited — a single prompt wording contributes 20% of each category's signal.

Several prompts ask for the “best,” “modern,” “fastest,” or most simplified setup. That wording tends to favor consolidated defaults like Ruff and uv over multi-tool stacks, so the results should be read in that frame.

Training data cutoffs can also affect recommendation rates. Ruff and uv saw explosive adoption in 2024–2025; a model with a later cutoff may simply have seen more examples. We cannot isolate this effect from the data.

Three responses were excluded from primary-tool tallies because they returned implementation summaries instead of an explicit tool recommendation. Primary tool was determined by the first tool the response recommends as its main answer — not by mention count or secondary suggestions.

We run these benchmarks for devtool companies too. Learn more

Get notified when new benchmarks drop.

Explore more research

This study is part of a broader research program measuring the subjective choices AI coding agents make.

The Tools OpenAI Agreed to Buy — Astral (Ruff, uv) Benchmark — Amplifying