MMLU-Pro Task Design#
Overview#
MMLU-Pro evaluates language understanding across 14 professional domains with 12,000+ multiple-choice questions. Each question has multiple answer choices.
Task Generation#
MMLU-Pro tasks must be generated from the HuggingFace dataset before they can be used for evaluation. We provide a tool script to download and convert tasks, which you could refer at tools/mmlu_pro/README.md.
Tool Script Usage#
Run the task conversion script to download and convert the MMLU-Pro dataset:
python tools/mmlu_pro/build_mmlu_pro_task.py
The script will
Downloads the MMLU-Pro dataset from HuggingFace (
TIGER-Lab/MMLU-Pro)Loads both test and validation splits
Creates a directory for each task at
external/MMLU-Pro/tasks/{index}/Saves the instance data as
instance.jsonin each task directory
The script processes the entire test set and the progress is displayed via a progress bar.
Example logs:
data/test-00000-of-00001.parquet: 100%|███████████████████████████████████████████| 4.15M/4.15M [00:02<00:00, 1.55MB/s]
data/validation-00000-of-00001.parquet: 100%|█████████████████████████████████████| 42.9k/42.9k [00:01<00:00, 40.9kB/s]
Generating test split: 100%|███████████████████████████████████████████| 12032/12032 [00:00<00:00, 364824.92 examples/s]
Generating validation split: 100%|████████████████████████████████████████████| 70/70 [00:00<00:00, 59122.29 examples/s]
100%|███████████████████████████████████████████████████████████████████| 12032/12032 [00:01<00:00, 7786.72it/s]
Once all the tests processed, you will see the output tasks in external/MMLU-Pro/tasks/.
Task Format#
Each MMLU-Pro task represents a single multiple-choice question. Tasks are stored locally in the repository under external/MMLU-Pro/tasks/.
Task Structure#
Task Name: Integer index as string (e.g., “0”, “1”, “42”)
Directory Layout:
external/MMLU-Pro/tasks/
├── 0/instance.json
├── 1/instance.json
├── 2/instance.json
...
└── N/instance.json # Where N is the total number of existing test instances
Data Format#
Inside instance.json, each conversation is formatted as:
{
"question_id":70,
"question":"Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.",
"options":[
"Safe practices, Fear, Jealousy, Trivial",
"Unsafe practices, Distress, Joy, Trivial",
"Safe practices, Wants, Jealousy, Trivial",
"Safe practices, Distress, Fear, Trivial",
"Unsafe practices, Wants, Jealousy, Serious",
"Safe practices, Distress, Jealousy, Serious",
"Safe practices, Wants, Fear, Serious",
"Unsafe practices, Wants, Fear, Trivial",
"Unsafe practices, Distress, Fear, Serious"
],
"answer":"I",
"answer_index":8,
"cot_content":"",
"category":"business",
"src":"ori_mmlu-business_ethics"
}
The task data is loaded from the local instance.json file at external/MMLU-Pro/tasks/{task_name}/instance.json. Note the question_id is not the same as the task name (task directory).
Task Implementation & Configuration#
Task type:
autopilot/evaluation/tasks/mmlu_pro.pyDataset Loading: Loaded from local JSON files in
external/MMLU-Pro/tasks/{task_name}/instance.jsonPrompt Template:
autopilot/prompts/mmlu_pro_prompt.py
Docker Environment#
Docker Files Location: tools/mmlu_pro/
build_image/Dockerfile # Base image (`mmlupro.common:latest`)
Dockerfile # Task-specific image extending base
docker-compose.yaml # Container orchestration (patched at runtime to use bridge network)
test_script.sh # Evaluation script that reads `/testbed/answer.txt`
Container Details:
Shared Image: All tasks use the same minimal environment built from the same Dockerfile (Check
tools/mmlu_pro/Dockerfile)Container Name Format:
mmlu_pro_{session_name}, wheresession_nameis composed by{task_name}-{timestamp}-{session-specific-random-hex-string}Working Directory:
/testbed/(where agent writes answer)Eval Script Path: Copied to
/tmp/tasks/shared_scripts/test_script.shin containerTask Directory:
/tmp/tasks/{task_name}/in containerImage Build: Automatically built on first use via
launch_container()method
Prompt Format#
Questions are formatted with the prompt template defined in autopilot/prompts/mmlu_pro_prompt.py.
When constructing input prompt for a task, the answer choices will be formatted as options like:
A. Option 1
B. Option 2
C. Option 3
...
J. Option 10
Environment Variables#
The following environment variables are configured for each task container:
MMLU_PRO_TASK_BUILD_CONTEXT_DIR: Points totools/mmlu_pro/MMLU_PRO_TASK_DOCKER_CLIENT_IMAGE_NAME: Set to"mmlu_pro"MMLU_PRO_TASK_DOCKER_NAME_PREFIX: Container nameMMLU_PRO_TEST_DIR: Path to test filesMMLU_PRO_TASK_LOGS_PATH/MMLU_PRO_CONTAINER_LOGS_PATH:/logsMMLU_PRO_TASK_DOCKER_CLIENT_CONTAINER_NAME: Container nameTEST_DIR:/tmp/tasks/{task_name}/testsTASK_DIR:/tmp/tasks/{task_name}COMPOSE_PROJECT_NAME: Container nameCOMPOSE_DOCKER_CLI_BUILD/DOCKER_BUILDKIT: Docker build settings
Evaluation Success Criteria#
The success criteria is string-matched evaluation (string-match):
Agent writes single letter to
/testbed/answer.txtEvaluation script (
test_script.sh) reads file content usingcat /testbed/answer.txtCompare if the answer choice selected is the same as corresponding ground truth
References#
Official Repository: github.com/TIGER-AI-Lab/MMLU-Pro