Automatically decomposes long-horizon robot demonstrations into sub-tasks by retrieving similar segments from a prior database using ANN search and optimal partitioning via dynamic programming. Training-free — relies on pre-trained visual encoders (LIV, CLIP, R3M).
Paper Info
Field
Value
Title
RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks
RDD addresses planner-visuomotor dataset misalignment in hierarchical VLA frameworks. Instead of relying on human annotations or heuristics (like UVD) for sub-task decomposition, RDD:
Embeds all frames of a demonstration using a pre-trained visual encoder (LIV/CLIP/R3M)
Retrieves similar sub-task segments from a prior database via approximate nearest neighbor (ANN) search
Decomposes the demonstration via dynamic programming that finds the optimal partition maximizing retrieval similarity, subject to length constraints
Key insight: Formulate sub-task decomposition as an optimal partitioning problem where the score of each candidate segment is its similarity to the nearest prior sub-task — achieving near-oracle performance (only 0.2% success rate gap vs expert annotations) with linear time complexity.
Computes per-segment IoU against ground truth annotations.
Code Integration Guide
Minimal Imports
from rdd.algorithms import max_sum_partition, rdd_score
from rdd.datasets.rlbench import RLBenchAnnoySearcher
from rdd.embed import uvd_embed, subtask_embeds_to_feature
Decompose a New Demonstration
import numpy as np
# 1. Embed frames
embeds = uvd_embed(
frame_paths, # list of image file paths
preprocessor="liv", # or "clip", "r3m"
device="cuda",
) # -> (T, D) np.ndarray
# 2. Load prior sub-task database + ANN index
searcher = RLBenchAnnoySearcher(
searcher_path="data/vec_databases/franka/train/index.ann",
vec_database_path="data/vec_databases/franka/train",
include_views=["front_rgb"],
n_trees=10,
distance_measure="angular",
)
# 3. Run optimal partitioning
score, segments = max_sum_partition(
u=list(range(len(embeds))),
score_func=rdd_score,
min_len=2,
max_len=100,
searcher=searcher,
embeds=embeds,
mode="ood",
alpha=0.0, # forced to 0 in ood mode
beta=0.1,
)
# segments: list of lists, e.g. [[0,1,...,29], [30,31,...,74], ...]
Data Format
Field
Format
Description
Frames
PNG images, any resolution
Resized internally by encoder
Sub-task annotations (info.txt)
One integer per line
Ending frame index of each sub-task
Embeddings
(T, D) float32 np.ndarray
D depends on encoder (LIV=1024, R3M=2048)
Sub-task feature (default)
(2*D*V,) float32
Concat of start + end frame embeddings across views
Data conversion scripts (AgiBotWorld, RoboCerebra, video→frames)
resources/
Demo videos + sub-task annotation examples
Key Results
Only 0.2% success rate gap compared to expert-annotated (oracle) decompositions
Outperforms UVD heuristic decomposition on both simulation (RLBench) and real-world tasks
Training-free — no model training required, only pre-trained visual encoders + ANN indexing
Linear time complexity via the optimal partitioning DP algorithm
Tips & Gotchas
RDD is training-free — no custom weights to train. The "prior" is built from your annotated sub-task dataset via build_vec_database.py
The mode parameter matters: ood (end-frame only, alpha=0) works better for out-of-distribution tasks; default (start+end, with length penalty) is for in-distribution
UVD and LIV must be cloned into 3rdparty/ via setup_rdd_env.sh — they are not pip-installable
The server architecture (FastAPI) is designed for integration with a higher-level planner that sends decomposition requests
beta controls how much RDD defers to UVD's heuristic boundaries — set higher (0.25) for tasks where UVD works well, lower (0.1) for novel tasks
Multi-view support: views are concatenated in embedding space, increasing feature dimension proportionally
info.txt annotation format: each line is the ending frame index of a sub-task (not the starting index)
Annoy indexes are not updatable — rebuild with build_vec_database.py when adding new sub-tasks