Reptile: Terminal-Agent with Human-in-the-loop Learning

Introduction

We propose Reptile, a terminal agent that operates under an extended REPL (Read-Execute-Print-Learn Loop) protocol, where human feedback is seamlessly integrated into the agent’s execution loop.

Unlike traditional REPL (Read-Execute-Print Loop) environments that focus solely on code evaluation, our REPL protocol emphasizes the iterative cycle of human-agent collaboration, transforming the terminal from a passive command executor into an interactive learning environment.

What Makes Reptile Special?

Compared with other CLI agents (e.g., Claude Code and Mini SWE-Agent), Reptile stands out for the following reasons:

Terminal-only beyond Bash-only: Simple and stateful execution, which is more efficient than bash-only (you don’t need to specify the environment in every command). It doesn’t require the complicated MCP protocol—just a naive bash tool under the REPL protocol.
Human-in-the-Loop Learning: Users can inspect every step and provide prompt feedback, i.e., give feedback under the USER role or edit the LLM generation under the ASSISTANT role.

As noted in the post from the Mini SWE-Agent team, implementing stateful shell sessions presents significant challenges. We address this challenge by detecting the non-canonical mode of TTY.

Terminal UI (`autopilot run`)	Web UI (`autopilot gradio`)

Batch Evaluation (`autopilot evaluate`)	Trajectory Viewer

This blog focus on workflow and benchmarking.

See TTY-use blog for technical details on how to make terminal backend work.
See on-policy annotation blog for annotation details on SWE tasks.

Our Insights in Building General Agents

Workflow: Build the universal action space for the LLM, reserving specialized workflows only for high-risk operations.

Evaluation: Focus on learning efficiency on meta-actions like inspecting file in right way, besides end2end benchmark, which makes the optimization more trackable.

Annotation: Correct Agent’s behaviour with clever annotation (like use PDB Debugging for coding), which enjoys stateful re-run and on-policy prediction.

First Target Milestone

Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Why Terminal Agent?

Promising testbed for Agent Learning

Wide Applicability: Spans everyday tasks to professional workflows (software engineering, DevOps, containerization) through a single universal interface.
Native LLM Compatibility: Terminal protocols are inherently understood through pretraining—no prompt engineering needed, unlike heavyweight protocols like MCP.
Core Research Challenges: Naturally encompasses long-horizon reasoning, context management, error recovery, and compositional tool use.

Native Universal Protocol

The Unix terminal has always been the universal text interface between human and machine. We believe it can serve the same role for AI agents.

At its core, the terminal is a text-based REPL protocol with half a century of history and refinement:

Interpreters: bash, python, node, perl, ruby and countless others
Debuggers: gdb, pdb, lldb for interactive debugging
Development tools: git, make, docker for workflows
System utilities: Thousands of battle-tested Unix tools

This mature ecosystem means agents can leverage decades of tooling without reinventing the wheel.

TTY Implementation Details

We build our LLM agent with https://github.com/sail-sg/tty-use, it implements the REPL protocol to use terminal interactively. A key challenge we solve is how to detect that the foreground process has finished its current job and waiting for next interaction.

For technique details in tty-use, please see our post at https://terminal-agent.github.io/blog/tool/ and https://x.com/mavenlin/status/1977758827366817929.

How REPL enables Human-in-the-loop Learning?

Human-in-the-loop isn’t just for runtime—it’s also central to our data collection strategy for further model training.

Please refer to https://reptile.github.io/blog/annotation/ for more annotation details and cases.

Data Collection Workflow

Our data pipeline source

All branches are automatically logged with a checkpointing hook.
User approval / disapproval means something.
LLM after feedback > LLM before feedback.
LLM after edit > LLM before edit.
Feedback & edit are natural to user.
The more you use, the more data you generate to make the model more like you.

Usage of the data

Supervised finetuning.
Preference optimization.
RLHF (use the data to train reward model then RL)

Benchmarking

We annotate tasks on SWEgym. After training with 200 interactions, this improves Devstral-2505-22B performance:

Terminal-bench: 11.3% -> 18.9%
SWE-Bench-Verified: 18.6 -> 32.8%

Looking Forward

We are actively working on several exciting directions:

1. Terminal Gym for RL Training

We aim to build an Terminal Gym that provides a structured environment for reinforcement learning. This includes (1) procise reward modeling (2) robust and scalable dockerized envs (3) easy-to-hard task sets.

2. Advanced Learning Algorithms

We are exploring offline RL, imitation learning, and other techniques to improve sample efficiency for extra-long agent trajectories (>30K length) and ultimately reduce the need for human supervision.

Open Source & Community

Reptile is open source and we welcome contributions! Whether you’re interested in:

Adding new benchmarks and evaluation tasks
Improving the hook system
Contributing training data
Building integrations with other tools like training/inference backends
Research discussion and resource collaboration

Visit our GitHub repository: https://github.com/terminal-agent/reptile

We are inspired by excellent community work such as terminal-bench and mini-SWE-agent. We thank the community for their efforts and valuable insights!

Conclusion

The terminal has been humanity’s interface to computers for 50 years. With Reptile, it becomes the interface between humans and AI agents. Reptile represents a new paradigm for terminal agents: one that embraces human collaboration rather than trying to eliminate it.

By extending the familiar REPL protocol with a learning layer, we create a system that:

Leverages the mature Unix ecosystem without reinvention
Provides transparency and control through human-in-the-loop interaction
Scales naturally to complex, multi-step tasks

Citation

If you find Reptile useful in your research or applications, please cite:

@misc{reptile2025,
  title={Reptile: Terminal-Agent with Human-in-the-loop Learning},
  author={Dou, Longxu and Du, Cunxiao and Li, Shenggui and Wang, Tianduo and Zhang, Tianjie and Liu, Tianyu and Chen, Xianwei and Tang, Chenxia and Zhao, Yuanheng and Lin, Min},
  year={2025},
  url={https://github.com/terminal-agent/reptile},
  note={GitHub repository}
}

Fun fact: The name “Reptile” has a dual meaning: it refers to the REPL (Read-Eval-Print-Learning Loop) workflow in terminal interactions, and also pays homage to OpenAI’s Reptile meta-learning algorithm (2018), which pioneered few-shot adaptation. Like its namesake, our Reptile learns to quickly adapt to new tasks—but through human-in-the-loop collaboration rather than pure algorithmic optimization. Both share the same philosophy: learning efficiently from minimal examples to master diverse tasks.
Reference: On First-Order Meta-Learning Algorithms

Reptile: Terminal-Agent with Human-in-the-loop Learning

Introduction#

What Makes Reptile Special?#

Our Insights in Building General Agents#

First Target Milestone#

Why Terminal Agent?#

Promising testbed for Agent Learning#

Native Universal Protocol#

TTY Implementation Details#

How REPL enables Human-in-the-loop Learning?#

Data Collection Workflow#

Usage of the data#

Benchmarking#

Looking Forward#

1. Terminal Gym for RL Training#

2. Advanced Learning Algorithms#

Open Source & Community#

Conclusion#

Citation#