LoCoMo Task Design#

Overview#

LoCoMo (Long Conversation Memory) evaluates long-term conversational memory across extended multi-session dialogues. Tasks test cross-session information retrieval and temporal reasoning.

Data Loading#

LoCoMo tasks are loaded from a pre-existing local dataset file. No separate generation step is required.

Dataset Location: external/locomo/data/locomo10.json

The dataset is loaded automatically when initializing any LoCoMo task instance via LocomoTask.load_locomo().

Task Format#

Task Structure#

Task Name Format: <conversation_id>_<question_id>

Both IDs are zero-based numeric indices
Examples: 0_0, 0_1, 5_3, 9_12
Format: Conversation index (0-9) + underscore + question index within that conversation
LocomoTask.task_names() is used for task name formation

Data Format#

The dataset is a JSON array where each element represents one multi-tern conversation with multiple QA pairs. For example:

{
  "sample_id": "conv-26",
  "qa": [
    {
      "question":"When did Caroline go to the LGBTQ support group?",
      "answer":"7 May 2023",
      "evidence":[
          "D1:3"
      ],
      "category":2
    },
    // ...
  ],
  "conversation": {
    "speaker_a": "Caroline",
    "speaker_b": "Melanie",
    "session_1_date_time": "1:56 pm on 8 May, 2023",
    "session_1": [
      {
          "speaker":"Caroline",
          "dia_id":"D1:1",
          "text":"Hey Mel! Good to see you! How have you been?"
      },
      {
          "speaker":"Melanie",
          "dia_id":"D1:2",
          "text":"Hey Caroline! Good to see you! I'm swamped with the kids & work. What's up with you? Anything new?"
      },
      // ...
    ],
    "session_2_date_time":"1:14 pm on 25 May, 2023",
    "session_2": [
      {
          "speaker":"Melanie",
          "dia_id":"D2:1",
          "text":"Hey Caroline, since we last chatted, I've had a lot of things happening to me. I ran a charity race for mental health last Saturday \u2013 it was really rewarding. Really made me think about taking care of our minds."
      },
      // ...
    ],
  },
  "observation": {
    // ...
  },
  "session_summary": {
    // ...
  }
}

conversation and qa for a conversation are two key fields we focus on.

conversation: Contains list of sessions (session_<num>) and their timestamps (session_<num>_date_time). The numbers <num> represent the chronological order of the sessions.

speaker_a, speaker_b: Names of the two conversation participants
session_N: Array of chat messages for session N (N starts from 1)
- Each turn contains:
  - speaker: Name of the speaker
  - dia_id: Dialog ID (format: “D:”, e.g., “D1:3”)
  - text: Content of the dialog
session_N_date_time: Timestamp string (format: “HH:MM am/pm on DD Month, YYYY”)

qa: Array of question-answer pairs about this conversation

question: The question to answer
answer: Ground truth answer
evidence: References to relevant parts (e.g., [“D1:3”] refers to session 1, turn 3)
category: Question category type

Task Implementation & Configuration#

Task type: autopilot/evaluation/tasks/locomo.py
Data Loading: load_locomo() reads locomo10.json at class initialization
Prompt Template: autopilot/prompts.py

Docker Environment#

Docker Files Location: tools/locomo/

build_image/Dockerfile
Dockerfile  # Task-specific image extending base
docker-compose.yaml  # Container orchestration (patched at runtime to use bridge network)
test_script.sh  # Evaluation script

Container Details:

Shared Image: All tasks use the same minimal environment built from the same Dockerfile (Check tools/locomo/Dockerfile)
Container Name Format: locomo_{session_name}, where session_name is composed by {task_name}-{timestamp}-{session-specific-random-hex-string}
Working Directory: /testbed/ (where conversation.txt is written)
Eval Script Path: Copied to /tmp/tasks/shared_scripts/test_script.sh in container
LLM Judge Path: /tmp/tasks/shared_scripts/llm_judge.py
Image Build: Automatically built on first use via launch_container() method

File Locations in Container:

Processed conversation: /testbed/conversation.txt (written by LocomoTask.write_conversation())
Agent’s answer: /testbed/answer.txt
Gold answer: /tmp/tasks/answer.txt

Prompt Format#

The conversation is processed chronologically by session timestamp using LocomoTask.process_conversation, then formatted with the prompt template in autopilot/prompts/locomo_prompt.py

Processing Steps:

Extract all sessions from conversation data
Sort sessions by timestamp (converted to sortable format)
Format each session with timestamp header
Messages shown as {speaker_name}: {text}
Sessions separated by double newlines

Environment Variables#

The following environment variables are configured for each task container:

LOCOMO_TASK_BUILD_CONTEXT_DIR: Points to tools/locomo/
LOCOMO_TASK_DOCKER_CLIENT_IMAGE_NAME: Set to "locomo"
LOCOMO_TASK_DOCKER_NAME_PREFIX: Container name
LOCOMO_TEST_DIR: Path to test files
LOCOMO_TASK_LOGS_PATH / LOCOMO_CONTAINER_LOGS_PATH: /logs
LOCOMO_TASK_DOCKER_CLIENT_CONTAINER_NAME: Container name
TEST_DIR: /tmp/tasks/{task_name}/tests
TASK_DIR: /tmp/tasks/{task_name}
COMPOSE_PROJECT_NAME: Container name
COMPOSE_DOCKER_CLI_BUILD / DOCKER_BUILDKIT: Docker build settings

Evaluation Success Criteria#

The success criteria is judged by LLM (llm-judge) with the following flow:

Agent receives full conversation history + question via LOCOMO_PROMPT
Agent generates answer and writes it to /testbed/answer.txt
Evaluation script (test_script.sh) invokes LLM judge to compare answers
LLM judge compares agent answer to ground truth answer and provide a conclusion (e.g., CORRECT or WRONG)

References#

Official Repository: github.com/snap-research/locomo
Project Page: snap-research.github.io/locomo

LoCoMo Task Design

Contents