First Steps
Running the benchmark and testing your own agent.
Run the benchmark
First, make sure you have set the correct API keys for the models you want to use. Refer to LiteLLM to see which key corresponds to the model you want to use.
The Terminal-Bench CLI can run many different benchmarks and versions of benchmarks.
To see the available benchmarks, run tb datasets list.
For example, to run the latest, pre-release terminal-bench-core dataset with the Terminus agent and Claude Sonnet 4:
Run tb run --help to see the available commands and options.
Testing a custom agent
Terminal-Bench comes with many popular agents available out of the box. Run tb run --help to see the available agents.
If you want to test a custom agent, you can do so by implementing the BaseAgent interface. If your agent is accessible as a package (e.g. Claude Code or Codex), you can implement the AbstractInstalledAgent interface instead.
Once you have implemented your agent, you can run it with the CLI by including the --agent-import-path flag instead of the --agent flag.
E.g.
Getting help
Please reach out to us on Discord if you have any questions or feedback.