Evaluation#

Overview#

We have 5 types of tasks for evaluation:

Software Engineering: SWEbench-verified, SWE-rebench(WIP).
Terminal Task: Terminal-bench
Context Management: LoCoMo
General LLM: MMLU-Pro, GSM8K
Agent Task: Tau-2 (WIP)

Build SWEBench-Verified tasks#

This script transfers the original swebench-verified huggingface-dataset instance into agent-readable task format. The generated tasks will be created under external/swebench-verified/tasks.

python tools/swegym/build_swe_tasks.py --swebench

Benchmark#

Some examples of how to evaluate different benchmarks are shown below.

Terminal Bench

# single-eval
autopilot evaluate \
--benchmark terminal_bench \
--task hello-world \
--interaction executive_only \
--eval-criteria rule-based \
--no-log-to-github --no-log-to-mongodb

# batch-eval
autopilot evaluate \
--parallel 16 \
--benchmark terminal_bench \
--interaction executive_only \
--eval-criteria rule-based \
--no-log-to-github --no-log-to-mongodb

SWEBench-verified Verified

# single-eval
autopilot evaluate \
--benchmark swebench_verified \
--task astropy__astropy-12907 \
--interaction executive_only \
--eval-criteria classified-pytest \
--no-log-to-github --no-log-to-mongodb

# batch-eval
autopilot evaluate \
--parallel 16 \
--benchmark swebench_verified \
--interaction executive_only \
--eval-criteria classified-pytest \
--no-log-to-github --no-log-to-mongodb

MMLU-Pro Benchmark

# single-eval
autopilot evaluate \
--benchmark mmlu_pro \
--task 1 \
--interaction executive_only \
--eval-criteria string-match \
--no-log-to-github --no-log-to-mongodb

# single-eval with naive interaction
autopilot evaluate \
--benchmark mmlu_pro \
--task 1 \
--interaction naive \
--cache-level naive \
--eval-criteria naive --no-log-to-github --no-log-to-mongodb

# batch-eval
autopilot evaluate \
--benchmark mmlu_pro \
--parallel 20 \
--interaction executive_only \
--eval-criteria string-match \
--no-log-to-github --no-log-to-mongodb

# batch-eval with naive interaction
autopilot evaluate \
--benchmark mmlu_pro \
--interaction naive \
--parallel 20 \
--eval-criteria naive \
--cache-level naive \
--no-log-to-github --no-log-to-mongodb

LoCoMo Benchmark

# single-eval
autopilot evaluate \
--benchmark locomo \
--task 0_1 \
--interaction executive_only \
--eval-criteria llm-judge \
--no-log-to-github --no-log-to-mongodb

# batch-eval
autopilot evaluate \
--benchmark locomo \
--parallel 16 \
--interaction executive_only \
--eval-criteria llm-judge \
--no-log-to-github --no-log-to-mongodb

GSM8K Benchmark

# single-eval
autopilot evaluate \
--benchmark gsm8k \
--task 1 \
--interaction executive_only \
--eval-criteria string-match \
--no-log-to-github --no-log-to-mongodb

# single-eval with naive interaction
autopilot evaluate \
--benchmark gsm8k \
--task 1 \
--interaction naive \
--eval-criteria naive \
--cache-level naive \
--no-log-to-github --no-log-to-mongodb

# batch-eval
autopilot evaluate \
--benchmark gsm8k \
--parallel 20 \
--interaction executive_only \
--eval-criteria string-match \
--no-log-to-github --no-log-to-mongodb

# batch-eval with naive interaction
autopilot evaluate \
--benchmark gsm8k \
--interaction naive \
--parallel 20 \
--eval-criteria naive \
--cache-level naive \
--no-log-to-github --no-log-to-mongodb

Advanced Function#

We have also built some advanced features for better evaluation.

Batch Eval with Parallel Models#

autopilot evaluate \
--benchmark terminal_bench \
--parallel 16 \
--model "devstral" \
--model "devstral-sft-0904" \
--interaction executive_only \
--eval-criteria rule-based \
--no-log-to-github --no-log-to-mongodb

Reloading tasks#

autopilot evaluate \
--benchmark swe_bench \
--task sglang-grep-tree-1 \
--reload sglang-grep-tree-1-20250707-144905-306e:1:5 \
--interaction interactive \
--no-log-to-github --no-log-to-mongodb

Single Eval on Reload Task#

autopilot evaluate \
--benchmark swe_bench \
--task sglang-grep-tree-1 \
--reload sglang-grep-tree-1-20250707-144905-306e:1:5 \
--reload-eval-path external/reload-config/grep-tree-reload-eval.yml \
--interaction executive_only \
--eval-criteria classified-pytest \
--no-log-to-github --no-log-to-mongodb

where external/reload-config/grep-tree-reload-eval.yml stores the evaluation configs for reload.

Batch Eval on Reload Tasks#

autopilot evaluate \
--benchmark swe_bench \
--reload-eval-path external/reload-config/grep-tree-reload-eval.yml \
--interaction executive_only \
--eval-criteria classified-pytest \
--no-log-to-github --no-log-to-mongodb

All reload tasks within external/reload-config/grep-tree-reload-eval.yml will be tested in this batch mode.

Evaluation

Contents