Research

Edwin Ong & Alex Vikati · mar-2026

The Tools OpenAI Agreed to Buy

OpenAI announced plans to acquire Astral — makers of Ruff and uv — on March 19, 2026. Ruff and uv are already the default Python toolchain. The question is whether OpenAI's own agent recommends them at different rates than Anthropic's. We ran 630 benchmarks across 7 Python tooling categories to find out.

7 categories · 3 Python repos · Claude Code (Opus 4.6, v2.1.78) vs OpenAI Codex (GPT-5.3, codex-cli 0.114.0)

The big finding: Astral tools account for 75% of primary picks — and both agents agree. The Ruff gap between Claude and Codex is 4pp; the uv gap is 0.4pp. For comparison, Bun shows a 50pp gap in the broader study.

Ruff & uv Breakdown The Bun Contrast All Categories Prompt Examples

Codex vs Claude Code Study Original Study (3-agent)

630

Total Responses

315 per agent

Agents

GPT-5.3Opus 4.6

Ruff + uv = 75% of Picks

Two tools from one vendor account for three-quarters of analyzable primary recommendations in this study. The overall model-level gaps are small, but category-level agreement is partial rather than complete.

Ruff

295

96% of its categories

Linting, formatting, CI/CD — Ruff dominates every code quality category.

Linting & Static Analysis100%

Code Formatting93%

CI/CD Quality Pipeline96%

Share of all picks (across all 7 categories)

Opus

49%

Codex

45%

4pp gap — includes categories where Ruff isn't applicable (type checking, venvs)

177

66% of its categories

Packages, project setup, virtual environments — uv is the most common recommendation in these infrastructure-oriented categories.

Package & Dependency Management93%

Python Project Setup & Toolchain61%

Virtual Environments & Python Versions46%

Share of all picks (across all 7 categories)

Opus

28%

Codex

28%

0.4pp gap — includes categories where uv isn't applicable (linting, formatting, CI, type checking)

All 7 Categories

Green badge = same winner, amber = different top pick.

Linting & Static AnalysisAgree

Ruff 100% (90/90)

OpenAI Codex (GPT-5.3)

Ruff100% (45)

Claude Code (Opus 4.6)

Ruff100% (45)

Complete consensus. Ruff is the primary recommendation in all 90 analyzable linting responses, with zero alternatives winning a single row.

90 prompts sent (45 per agent)

Code FormattingAgree

Ruff 93% (84/90)

OpenAI Codex (GPT-5.3)

Ruff93% (42)

Black7% (3)

Claude Code (Opus 4.6)

Ruff93% (42)

Black7% (3)

Near-identical results. Both agents land on the exact same split: Ruff in 42 of 45 formatting rows, Black in 3 of 45. The remaining Black picks read like legacy-safe defaults, not momentum winners.

90 prompts sent (45 per agent)

Package & Dependency ManagementAgree

uv 93% (82/88)

OpenAI Codex (GPT-5.3)

uv86% (37)

pip14% (6)

Claude Code (Opus 4.6)

uv100% (45)

uv wins 82 of 88 analyzable package-management responses. Claude is 45/45 for uv; Codex is 37/43 for uv, with 6 pip picks and 2 excluded rows that never committed to one package manager.

90 prompts sent (45 per agent)

CI/CD Quality PipelineAgree

Ruff 96% (86/90)

OpenAI Codex (GPT-5.3)

Ruff93% (42)

pre-commit7% (3)

Claude Code (Opus 4.6)

Ruff98% (44)

Black2% (1)

Ruff still owns the CI-quality category at 86 of 90 responses. Codex occasionally promotes pre-commit as the top-level workflow, but Ruff remains the core quality tool in both agents' answers.

90 prompts sent (45 per agent)

Python Project Setup & ToolchainDisagree

uv 61% (54/89)

OpenAI Codex (GPT-5.3)

uv73% (33)

Ruff27% (12)

Claude Code (Opus 4.6)

Ruff52% (23)

uv48% (21)

One of the two split categories. Claude names Ruff first in 23 of 44 analyzable setup responses, while Codex names uv first in 33 of 45. Both tools usually appear in the answer; the disagreement is about which one anchors the stack.

90 prompts sent (45 per agent)

Virtual Environments & Python VersionsAgree

uv 46% (41/90)

OpenAI Codex (GPT-5.3)

uv42% (19)

pyenv27% (12)

venv24% (11)

pip7% (3)

Claude Code (Opus 4.6)

uv49% (22)

pyenv31% (14)

venv20% (9)

The most fragmented category. uv leads with 41 of 90 responses, but pyenv (26) and venv (20) still hold real share. That is consistent with a category where environment isolation and Python version management are still treated as adjacent but distinct problems.

90 prompts sent (45 per agent)

Type CheckingDisagree

mypy 64% (58/90)

OpenAI Codex (GPT-5.3)

Pyright51% (23)

mypy49% (22)

Claude Code (Opus 4.6)

mypy80% (36)

Pyright20% (9)

The clearest disagreement. mypy wins overall 58/90 because Claude is 36/45 for mypy, while Codex slightly prefers Pyright at 23/45. No Astral type checker appears as a primary pick in this benchmark.

90 prompts sent (45 per agent)

Context: The Bun Contrast

This comparison mixes two datasets: Bun comes from the broader Codex vs Claude Code study, while Ruff and uv come from this dedicated Python benchmark. We include it as context for the size of the gap, not as an apples-to-apples causal test.

Tool	Company Context	Claude	Codex	Gap
Bun*	Broader head-to-head benchmark, JS runtime	63%	13%	50pp gap
This study (Astral Python benchmark)
Ruff	This study, Python tooling	49%	45%	4pp gap
uv	This study, Python tooling	28%	28%	0.4pp gap
* From the broader Codex vs Claude Code study (different categories, sample size)

Bun50pp gap

Claude

63%

Codex

13%

Ruff4pp gap

Claude

49%

Codex

45%

uv0.4pp gap

Claude

28%

Codex

28%

Ruff and uv show small cross-model gaps in this dedicated benchmark. Bun shows a 50pp gap in the broader benchmark. Ruff shows 4pp. uv shows 0.4pp. That contrast is notable, but it is still only contextual because the benchmarks, categories, and sample sizes differ.

The gap is the finding, not the cause. A larger Bun gap could reflect Claude favoring Bun, Codex favoring Node.js, prompt sensitivity, or a genuine product-quality difference. We can measure the disagreement; we can't determine the direction of any error from this data alone.

Training data cutoffs, tool maturity, ecosystem momentum, prompt design, and corporate incentives can all affect recommendation rates. We show the pattern because it's notable; we don't claim which side is right, and we do not treat the Bun rows as a direct control for the Astral study.

See for Yourself

Same prompt, same repo, two different agents. These are actual responses from the benchmark — unedited.

“this python project has no linting set up - what should i use”

python-apisame repo, same prompt

OpenAI CodexGPT-5.3

Primary pick

Ruff#1

OpenAI Codex

Use ruff.

For this project (FastAPI, Python >=3.11, pyproject.toml already present), Ruff is the best default: one tool for lint + import sorting + formatting, fast enough to run constantly.

Quick setup:

bash

pip install -e ".[dev]" ruff

toml

# pyproject.toml
[project.optional-dependencies]
dev = [
  "pytest>=8.0.0",
  "ruff>=0.6.0",
]

[tool.ruff]
line-length = 100
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "I", "B", "UP"]

[tool.ruff.format]
quote-style = "double"

Run:

bash

ruff check . --fix
ruff format .

Optional next tool: add mypy later for type checking once you add more annotations.

Sources:

https://docs.astral.sh/ruff/
https://docs.astral.sh/ruff/configuration/
https://docs.astral.sh/ruff/formatter/
https://mypy.readthedocs.io/en/stable/getting_started.html

Claude CodeOpus 4.6

Primary pick

Ruff#1

Claude Code

For a FastAPI project using Python 3.11+ and Hatch, I'd recommend Ruff. Here's why:

All-in-one: linter + formatter (replaces flake8, isort, pyupgrade, and black)
Fast: written in Rust, essentially instant on a project this size
Zero config to start: sensible defaults out of the box
Single tool: no need to coordinate multiple linters
Great pyproject.toml integration: config lives right where yours already is

To set it up, add it to your dev dependencies and add a minimal config:

toml

# in pyproject.toml

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "ruff>=0.9.0",
]

[tool.ruff]
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "I", "UP"]
# E/F = pyflakes + pycodestyle (the basics)
# I   = isort (import sorting)
# UP  = pyupgrade (modernize syntax for 3.11+)

Then run:

ruff check . — lint
ruff format . — format
ruff check --fix . — auto-fix what it can

Want me to add this to your pyproject.toml?

“whats the best way to manage python dependencies for this project”

python-apisame repo, same prompt

OpenAI CodexGPT-5.3

Primary pick

uv#1

OpenAI Codex

For this repo, the best approach is:

Keep `pyproject.toml` as the single source of truth (you already do this in pyproject.toml).
Use `uv` for environment + lockfile management (fast, simple, works great with FastAPI projects).
Commit a lockfile (uv.lock) for reproducible installs in CI/prod.

Recommended workflow:

bash

uv sync --extra dev      # install app + dev deps into local venv
uv run uvicorn src.main:app --reload
uv run pytest

CI/prod:

bash

uv sync --frozen --no-dev

Dependency updates:

bash

uv lock --upgrade-package fastapi

If you want, I can set this up now (add uv.lock, update README commands, and add a small CI-safe install pattern).

Claude CodeOpus 4.6

Primary pick

uv#1

Claude Code

You already have a pyproject.toml with dependencies declared — that's the modern standard. For a project this size, the simplest approach is `uv`:

Install deps: uv sync (creates a venv and lockfile automatically)
Add a dep: uv add httpx
Add a dev dep: uv add --group dev ruff
Run things: uv run pytest, uv run uvicorn src.main:app

Why `uv` over alternatives:

Much faster than pip / pip-tools / poetry
Generates a uv.lock lockfile for reproducible installs (your project doesn't have one yet)
Works directly with your existing pyproject.toml — no config changes needed
Manages Python versions too (uv python install 3.12)

Quick start:

bash

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create venv, install deps, generate lockfile
uv sync --all-groups

This will create a .venv/ directory and a uv.lock file. Commit the lockfile, gitignore the .venv/.

If you prefer something more established, pip-tools (pip-compile/pip-sync) or poetry also work, but uv is the current best choice for speed and simplicity with a standard pyproject.toml like yours.

“whats the best python type checker right now”

python-apisame repo, same prompt

OpenAI CodexGPT-5.3

Primary pick

Pyright#1

OpenAI Codex

As of March 19, 2026, the best overall default is Pyright.

If you want stricter behavior out of the box, use BasedPyright (a Pyright fork). If you want maximum speed and can tolerate churn, ty and Pyrefly are promising but still marked beta.

Sources:

Pyright (latest 1.1.408, Jan 8, 2026): https://pypi.org/project/pyright/
Mypy (latest 1.19.1, Dec 15, 2025, Production/Stable): https://pypi.org/project/mypy/
ty (latest 0.0.23, Mar 13, 2026, Beta): https://pypi.org/project/ty/
Pyrefly (latest 0.57.0, Mar 17, 2026, Beta): https://pypi.org/project/pyrefly/
BasedPyright (latest 1.38.3, Mar 17, 2026): https://pypi.org/project/basedpyright/

Claude CodeOpus 4.6

Primary pick

Pyright#1

Claude Code

The main contenders:

mypy — The original, most widely used. Stable, great ecosystem support, but can be slow on large codebases.
pyright (via Pylance in VS Code) — Microsoft's type checker. Significantly faster than mypy, stricter by default, excellent VS Code integration. Used by most VS Code Python developers already.
pytype — Google's type checker. More lenient, does type inference even without annotations. Niche usage.
pyre — Meta's type checker. Fast, but smaller community outside Meta.

Pyright is generally considered the best right now for most use cases — it's faster, catches more issues, has better inference, and provides a superior editor experience. It's also the de facto standard if you use VS Code.

mypy is still the safe/default choice if you need maximum compatibility (e.g., lots of third-party stubs, CI pipelines that already use it, or you want the most battle-tested option).

TL;DR: Use pyright unless you have a specific reason to stick with mypy.

“whats the best python code formatter right now”

python-apisame repo, same prompt

OpenAI CodexGPT-5.3

Primary pick

Ruff#1

OpenAI Codex

As of March 19, 2026, the best default choice is Ruff Formatter (ruff format).

Why:

It’s designed as a Black-compatible formatter with very high output parity on Black-formatted code.
It’s much faster and part of a unified toolchain (formatter + linter).
It’s actively maintained with recent releases.

Use Black if your team specifically wants the long-established “minimal configuration, ultra-stable style” baseline and doesn’t need Ruff’s speed/integration advantages.

Sources:

Ruff formatter docs: https://docs.astral.sh/ruff/formatter/
Ruff overview: https://docs.astral.sh/ruff/
Ruff release updates (latest listed): https://astral.sh/blog
Black docs: https://black.readthedocs.io/

Claude CodeOpus 4.6

Primary pick

Ruff#1

Claude Code

Black remains the most popular and widely recommended Python code formatter. It's opinionated, deterministic, and requires minimal configuration — which is its main selling point.

Other strong options:

Ruff — rapidly gaining adoption. It's a linter *and* formatter (written in Rust), extremely fast, and largely Black-compatible. If you want formatting + linting in one tool, Ruff is the current momentum pick.
Blue — a less opinionated fork of Black with slightly different style choices.
autopep8 / yapf — older alternatives, less commonly chosen for new projects.

TL;DR: Use Ruff if you want speed and an all-in-one tool. Use Black if you want the established standard with the largest ecosystem support. For most new projects in 2026, Ruff is the pragmatic default.

Methodology

How we ran the study: same prompts, same repos, independent agents, structured extraction.

Agents

Claude CodeOpus 4.6, v2.1.78

OpenAI CodexGPT-5.3, codex-cli 0.114.0

Study Design

7 categories, 5 prompts each
3 Python repos (different stacks)
3 independent runs per combo
All prompts open-ended (no tool names)

Scale

630 total responses
627 analyzable tool picks
Git-reset between prompts
Worktree isolation per run

Repos Used

python-api

FastAPI, Python 3.11

DataPipeline API

python-ai-agent

OpenAI SDK, Python 3.11, Pydantic

AIAgent

python-ai-assistant

FastAPI, OpenAI SDK, Python 3.11

AIAssistant

Prompt design note: All 35 prompts are open-ended. None name Ruff, uv, Black, pip, mypy, or any specific tool. Prompts ask things like “what's the best way to manage python dependencies” or “set up linting for this project.” This ensures the agent's recommendation reflects a default preference rather than direct prompt anchoring.

Limitations: This study measures default recommendation behavior on open-ended prompts; it does not establish universal tool rankings.

Each category has 5 unique prompts run against 3 repos with 3 independent runs, yielding 45 responses per model per category. The response count is meaningful, but the underlying prompt diversity is limited — a single prompt wording contributes 20% of each category's signal.

Several prompts ask for the “best,” “modern,” “fastest,” or most simplified setup. That wording tends to favor consolidated defaults like Ruff and uv over multi-tool stacks, so the results should be read in that frame.

Training data cutoffs can also affect recommendation rates. Ruff and uv saw explosive adoption in 2024–2025; a model with a later cutoff may simply have seen more examples. We cannot isolate this effect from the data.

Three responses were excluded from primary-tool tallies because they returned implementation summaries instead of an explicit tool recommendation. Primary tool was determined by the first tool the response recommends as its main answer — not by mention count or secondary suggestions.

We run these benchmarks for devtool companies too. Learn more

Get notified when new benchmarks drop.

Explore more research

This study is part of a broader research program measuring the subjective choices AI coding agents make.

Codex vs Claude Code Claude Code Picks All Research