LoCoMo Task Design#
Overview#
LoCoMo (Long Conversation Memory) evaluates long-term conversational memory across extended multi-session dialogues. Tasks test cross-session information retrieval and temporal reasoning.
Data Loading#
LoCoMo tasks are loaded from a pre-existing local dataset file. No separate generation step is required.
Dataset Location: external/locomo/data/locomo10.json
The dataset is loaded automatically when initializing any LoCoMo task instance via LocomoTask.load_locomo().
Task Format#
Task Structure#
Task Name Format: <conversation_id>_<question_id>
Both IDs are zero-based numeric indices
Examples:
0_0,0_1,5_3,9_12Format: Conversation index (0-9) + underscore + question index within that conversation
LocomoTask.task_names()is used for task name formation
Data Format#
The dataset is a JSON array where each element represents one multi-tern conversation with multiple QA pairs. For example:
{
"sample_id": "conv-26",
"qa": [
{
"question":"When did Caroline go to the LGBTQ support group?",
"answer":"7 May 2023",
"evidence":[
"D1:3"
],
"category":2
},
// ...
],
"conversation": {
"speaker_a": "Caroline",
"speaker_b": "Melanie",
"session_1_date_time": "1:56 pm on 8 May, 2023",
"session_1": [
{
"speaker":"Caroline",
"dia_id":"D1:1",
"text":"Hey Mel! Good to see you! How have you been?"
},
{
"speaker":"Melanie",
"dia_id":"D1:2",
"text":"Hey Caroline! Good to see you! I'm swamped with the kids & work. What's up with you? Anything new?"
},
// ...
],
"session_2_date_time":"1:14 pm on 25 May, 2023",
"session_2": [
{
"speaker":"Melanie",
"dia_id":"D2:1",
"text":"Hey Caroline, since we last chatted, I've had a lot of things happening to me. I ran a charity race for mental health last Saturday \u2013 it was really rewarding. Really made me think about taking care of our minds."
},
// ...
],
},
"observation": {
// ...
},
"session_summary": {
// ...
}
}
conversation and qa for a conversation are two key fields we focus on.
conversation: Contains list of sessions (session_<num>) and their timestamps (session_<num>_date_time). The numbers <num> represent the chronological order of the sessions.
speaker_a,speaker_b: Names of the two conversation participantssession_N: Array of chat messages for session N (N starts from 1)Each turn contains:
speaker: Name of the speakerdia_id: Dialog ID (format: “D: ”, e.g., “D1:3”) text: Content of the dialog
session_N_date_time: Timestamp string (format: “HH:MM am/pm on DD Month, YYYY”)
qa: Array of question-answer pairs about this conversation
question: The question to answeranswer: Ground truth answerevidence: References to relevant parts (e.g., [“D1:3”] refers to session 1, turn 3)category: Question category type
Task Implementation & Configuration#
Task type:
autopilot/evaluation/tasks/locomo.pyData Loading:
load_locomo()readslocomo10.jsonat class initializationPrompt Template:
autopilot/prompts.py
Docker Environment#
Docker Files Location: tools/locomo/
build_image/Dockerfile
Dockerfile # Task-specific image extending base
docker-compose.yaml # Container orchestration (patched at runtime to use bridge network)
test_script.sh # Evaluation script
Container Details:
Shared Image: All tasks use the same minimal environment built from the same Dockerfile (Check
tools/locomo/Dockerfile)Container Name Format:
locomo_{session_name}, wheresession_nameis composed by{task_name}-{timestamp}-{session-specific-random-hex-string}Working Directory:
/testbed/(where conversation.txt is written)Eval Script Path: Copied to
/tmp/tasks/shared_scripts/test_script.shin containerLLM Judge Path:
/tmp/tasks/shared_scripts/llm_judge.pyImage Build: Automatically built on first use via
launch_container()method
File Locations in Container:
Processed conversation:
/testbed/conversation.txt(written byLocomoTask.write_conversation())Agent’s answer:
/testbed/answer.txtGold answer:
/tmp/tasks/answer.txt
Prompt Format#
The conversation is processed chronologically by session timestamp using LocomoTask.process_conversation, then formatted with the prompt template in autopilot/prompts/locomo_prompt.py
Processing Steps:
Extract all sessions from conversation data
Sort sessions by timestamp (converted to sortable format)
Format each session with timestamp header
Messages shown as
{speaker_name}: {text}Sessions separated by double newlines
Environment Variables#
The following environment variables are configured for each task container:
LOCOMO_TASK_BUILD_CONTEXT_DIR: Points totools/locomo/LOCOMO_TASK_DOCKER_CLIENT_IMAGE_NAME: Set to"locomo"LOCOMO_TASK_DOCKER_NAME_PREFIX: Container nameLOCOMO_TEST_DIR: Path to test filesLOCOMO_TASK_LOGS_PATH/LOCOMO_CONTAINER_LOGS_PATH:/logsLOCOMO_TASK_DOCKER_CLIENT_CONTAINER_NAME: Container nameTEST_DIR:/tmp/tasks/{task_name}/testsTASK_DIR:/tmp/tasks/{task_name}COMPOSE_PROJECT_NAME: Container nameCOMPOSE_DOCKER_CLI_BUILD/DOCKER_BUILDKIT: Docker build settings
Evaluation Success Criteria#
The success criteria is judged by LLM (llm-judge) with the following flow:
Agent receives full conversation history + question via LOCOMO_PROMPT
Agent generates answer and writes it to
/testbed/answer.txtEvaluation script (
test_script.sh) invokes LLM judge to compare answersLLM judge compares agent answer to ground truth answer and provide a conclusion (e.g., CORRECT or WRONG)
References#
Official Repository: github.com/snap-research/locomo
Project Page: snap-research.github.io/locomo