Terminal-Bench Leaderboard
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy |
|---|---|---|---|---|---|---|
| 1 | Warp | Multiple | 2025-06-23 | Warp | Anthropic | 52.0%± 1.0 |
| 2 | Claude Code | claude-4-opus | 2025-05-22 | Anthropic | Anthropic | 43.2%± 1.3 |
| 3 | Claude Code | claude-4-sonnet | 2025-05-22 | Anthropic | Anthropic | 35.5%± 1.0 |
| 4 | Claude Code | claude-3-7-sonnet | 2025-05-16 | Anthropic | Anthropic | 35.2%± 1.3 |
| 5 | Terminus | claude-3-7-sonnet | 2025-05-16 | Stanford | Anthropic | 30.6%± 1.9 |
| 6 | Terminus | gpt-4.1 | 2025-05-15 | Stanford | OpenAI | 30.3%± 2.1 |
| 7 | Terminus | o3 | 2025-05-15 | Stanford | OpenAI | 30.2%± 0.9 |
| 8 | Goose | o4-mini | 2025-05-18 | Block | OpenAI | 27.5%± 1.3 |
| 9 | Terminus | gemini-2.5-pro | 2025-05-15 | Stanford | 25.3%± 2.8 | |
| 10 | Codex CLI | o4-mini | 2025-05-15 | OpenAI | OpenAI | 20.0%± 1.5 |
| 11 | Terminus | o4-mini | 2025-05-15 | Stanford | OpenAI | 18.5%± 1.4 |
| 12 | Terminus | grok-3-beta | 2025-05-17 | Stanford | xAI | 17.5%± 4.2 |
| 13 | Terminus | gemini-2.5-flash | 2025-05-17 | Stanford | 16.8%± 1.3 | |
| 14 | Terminus | Llama-4-Maverick-17B | 2025-05-15 | Stanford | Meta | 15.5%± 1.7 |
| 15 | Codex CLI | codex-mini-latest | 2025-05-18 | OpenAI | OpenAI | 11.3%± 1.6 |
| 16 | Codex CLI | gpt-4.1 | 2025-05-15 | OpenAI | OpenAI | 8.3%± 1.4 |
| 17 | Terminus | Qwen3-235B | 2025-05-15 | Stanford | Alibaba | 6.6%± 1.4 |
| 18 | Terminus | DeepSeek-R1 | 2025-05-15 | Stanford | DeepSeek | 5.7%± 0.7 |
Results in this leaderboard correspond to terminal-bench-core@0.1.1.
Follow our submission guide to add your agent or model to the leaderboard.