TestSprite launches public AI coding agent contest with open referee

5 hours ago

TestSprite has launched CoderCup, a public competition that had leading AI coding agents build the same app under the same clock and scored them with an open source verifier. The first event is complete, and the results are meant to show how quality, regressions and recovery matter more than raw speed in real software work.

Why it matters: - CoderCup is designed to measure how AI coding agents actually perform in software development, not just how many tests they pass. - The competition highlights quality, consistency and regressions, which matter when agents run unsupervised. - TestSprite says the format shows companies can use a verification layer to ship higher-quality software without defaulting to the most expensive model.

What happened: - TestSprite launched CoderCup, a public competition where AI coding agents built and deployed the same web app under one clock. - The first event included frontier agents such as Anthropic’s Claude Code, OpenAI Codex and Google’s Antigravity. - TestSprite published full results and per-phase scores at codercup.ai. - The first event is complete, and all phases have been scored.

The details: - TestSprite’s open source CLI served as the neutral referee. - The test suite includes hundreds of tests, is open source and accepts community pull requests. - Each verdict links to the evidence behind the score. - Anyone can clone the competition and rerun a phase to check the result. - Each agent received the same brief: build and ship a deployable web app across 10 phases. - Each phase had a shared time budget of about 60 minutes. - At the end of each phase, the agent filed a “ready for scoring” review. - TestSprite scored the deployed app’s live URL and ran that phase’s test plans against it. - A passing test on the agent’s own machine counted for nothing. - No agent graded its own work, and no score was hidden. - CoderCup’s headline composites clustered tightly, from 0.85 to 0.79. - The ranking is quality-only; cost and speed were tracked but excluded. - Correctness was not the biggest factor in the score. - Getting a feature right the first time, and not breaking what already worked, carried more weight than raw pass rate. - Claude Code won on consistency. - Codex and Antigravity were the quickest to finish each phase, with cumulative times under 100 minutes. - Kimi was the slowest, at around 350 minutes. - Kimi also posted the highest correctness score in the field at 0.89 and the lowest total cost. - Regressions ranged from 31 to 57 across the field. - Before scoring, TestSprite recorded each agent’s setup, model version, CLI version, allowed tools and time budget in a machine-readable manifest. - After scoring, the transcript, deployed app and verdict were published. - Clicking any leaderboard number leads to the supporting evidence. - The open source test suite can be cloned and rerun by anyone. - One phase required every agent to build the app’s predictions feature. - Every agent predicted Mexico would beat South Africa in the opener. - Two agents picked the exact same 2-0 scoreline. - The tournament matches will settle those predictions over the summer. - The live leaderboard, deployed applications, Code Sheets and per-phase scores are available at codercup.ai. - The competition repository is available at GitHub. - TestSprite also pointed to a companion CLI launch announcement at the CLI release post. - TestSprite released its CLI today under the Apache 2.0 license. - The CLI is the same verification engine that powers CoderCup.

Between the lines: - The competition is as much about trust as performance. - TestSprite is using open evidence, open tests and replayable results to make AI agent benchmarks harder to game. - The results suggest faster agents are not always better agents, and cheaper models can win on quality when verification is strict. - The soccer framing is a marketing hook, but the underlying point is software reliability.

What’s next: - The agents’ tournament predictions will be resolved over the summer. - TestSprite plans a final prediction-accuracy wrap-up after the tournament’s closing match. - TestSprite says CoderCup’s repository, test plans and results remain open for review today. - The company’s broader push centers on the open source CLI and its verification workflow for AI coding agents and human engineers alike.

Disclaimer: This article was produced by AGP Wire with the assistance of artificial intelligence based on original source content and has been refined to improve clarity, structure, and readability. This content is provided on an “as is” basis. While care has been taken in its preparation, it may contain inaccuracies or omissions, and readers should consult the original source and independently verify key information where appropriate. This content is for informational purposes only and does not constitute legal, financial, investment, or other professional advice.

Books & Publishers: Africa

The daily local news briefing you can trust. Every day. Subscribe now.

TestSprite launches public AI coding agent contest with open referee

Books & Publishers: Africa

Check Your Email!

Welcome back!