Contributor Spotlight July 8th 2025: Ivan Bercovich

Terminal-Bench is only possible because of our community. As an open-source project, we rely on our contributors to help us build tasks, fix bugs, implement new features, and port other benchmarks into our harness.

With this in mind, we're going to start posting “contributor spotlights” like this to highlight outstanding Terminal-Bench community members.

About Ivan

Our first spotlight is for Ivan Bercovich. Ivan dropped out of his PhD in Financial Math at UCSB in 2010 and went on to lead the technical team at Graphiq, which sold to Amazon in 2017. Graphiq became the question understanding and answering engine of Alexa, a sort of precursor to LLMs based on knowledge graphs and statistical methods.

After Amazon, Ivan started a VC firm called ScOp, but has continued his technical journey. This year, he began making small research contributions to projects at UCSB and CMU. Through discussions with other researchers, he became interested in DevOps/SRE agents, and from there landed on Terminal-Bench.

When asked about why he chose to work on Terminal-Bench, Ivan said, “Terminal-Bench embraces building difficult tasks that can take hours to design, which I find refreshing. The framework is also extremely versatile, which allowed me to convert personal projects into benchmarks. I definitely got hooked!”

install-windows-xp

Ivan has contributed several challenging tasks to Terminal-Bench. The most recent is install-windows-xp: a task where the agent is asked to download and install Windows XP inside an emulator in an Ubuntu container. The task is challenging because it requires using complicated tools (QEMU, xorriso, VNC, and netstat) and web browsing to locate the appropriate ISO.

In this recording, Terminus with GPT 4.1 ultimately fails to perform the installation. It attempts to locate a public ISO, fails, and then asks the user to download the ISO for it to use. It then polls the filesystem, waiting in vain for the user to take action before ultimately giving up.

This task shows that despite significant advances in LLM capabilities, agents still struggle to operate fully autonomously in demanding scenarios.

Want to help?

We're always looking for interesting new tasks to help shape the future of agentic terminal use. If you have an idea for a task, drop us a note in the Discord.