Robotics 63
★ MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Zhuoyang Liu, Jiaming Liu, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang
Vision-language-action models (VLAs) have shown generalization capabilities
in robotic manipulation tasks by inheriting from vision-language models (VLMs)
and learning action generation. Most VLA models focus on interpreting vision
and language to generate actions, whereas robots must perceive and interact
within the spatial-physical world. This gap highlights the need for a
comprehensive understanding of robotic-specific multisensory information, which
is crucial for achieving complex and contact-rich control. To this end, we
introduce a multisensory language-action (MLA) model that collaboratively
perceives heterogeneous sensory modalities and predicts future multisensory
objectives to facilitate physical world modeling. Specifically, to enhance
perceptual representations, we propose an encoder-free multimodal alignment
scheme that innovatively repurposes the large language model itself as a
perception module, directly interpreting multimodal cues by aligning 2D images,
3D point clouds, and tactile tokens through positional correspondence. To
further enhance MLA's understanding of physical dynamics, we design a future
multisensory generation post-training strategy that enables MLA to reason about
semantic, geometric, and interaction information, providing more robust
conditions for action generation. For evaluation, the MLA model outperforms the
previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex,
contact-rich real-world tasks, respectively, while also demonstrating improved
generalization to unseen configurations. Project website:
https://sites.google.com/view/open-mla
☆ Benchmarking Egocentric Visual-Inertial SLAM at City Scale ICCV 2025
Anusha Krishnan, Shaohui Liu, Paul-Edouard Sarlin, Oscar Gentilhomme, David Caruso, Maurizio Monge, Richard Newcombe, Jakob Engel, Marc Pollefeys
Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard
sensors is critical for wearable devices capturing egocentric data, which
exhibits specific challenges, such as a wider diversity of motions and
viewpoints, prevalent dynamic visual content, or long sessions affected by
time-varying sensor calibration. While recent progress on SLAM has been swift,
academic research is still driven by benchmarks that do not reflect these
challenges or do not offer sufficiently accurate ground truth poses. In this
paper, we introduce a new dataset and benchmark for visual-inertial SLAM with
egocentric, multi-modal data. We record hours and kilometers of trajectories
through a city center with glasses-like devices equipped with various sensors.
We leverage surveying tools to obtain control points as indirect pose
annotations that are metric, centimeter-accurate, and available at city scale.
This makes it possible to evaluate extreme trajectories that involve walking at
night or traveling in a vehicle. We show that state-of-the-art systems
developed by academia are not robust to these challenges and we identify
components that are responsible for this. In addition, we design tracks with
different levels of difficulty to ease in-depth analysis and evaluation of less
mature approaches. The dataset and benchmark are available at
https://www.lamaria.ethz.ch.
comment: ICCV 2025
★ OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi
A dominant paradigm for teaching humanoid robots complex skills is to
retarget human motions as kinematic references to train reinforcement learning
(RL) policies. However, existing retargeting pipelines often struggle with the
significant embodiment gap between humans and robots, producing physically
implausible artifacts like foot-skating and penetration. More importantly,
common retargeting methods neglect the rich human-object and human-environment
interactions essential for expressive locomotion and loco-manipulation. To
address this, we introduce OmniRetarget, an interaction-preserving data
generation engine based on an interaction mesh that explicitly models and
preserves the crucial spatial and contact relationships between an agent, the
terrain, and manipulated objects. By minimizing the Laplacian deformation
between the human and robot meshes while enforcing kinematic constraints,
OmniRetarget generates kinematically feasible trajectories. Moreover,
preserving task-relevant interactions enables efficient data augmentation, from
a single demonstration to different robot embodiments, terrains, and object
configurations. We comprehensively evaluate OmniRetarget by retargeting motions
from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour
trajectories that achieve better kinematic constraint satisfaction and contact
preservation than widely used baselines. Such high-quality data enables
proprioceptive RL policies to successfully execute long-horizon (up to 30
seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained
with only 5 reward terms and simple domain randomization shared by all tasks,
without any learning curriculum.
comment: Project website: https://omniretarget.github.io
☆ TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
Designing dense rewards is crucial for reinforcement learning (RL), yet in
robotics it often demands extensive manual effort and lacks scalability. One
promising solution is to view task progress as a dense reward signal, as it
quantifies the degree to which actions advance the system toward task
completion over time. We present TimeRewarder, a simple yet effective reward
learning method that derives progress estimation signals from passive videos,
including robot demonstrations and human videos, by modeling temporal distances
between frame pairs. We then demonstrate how TimeRewarder can supply step-wise
proxy rewards to guide reinforcement learning. In our comprehensive experiments
on ten challenging Meta-World tasks, we show that TimeRewarder dramatically
improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10
tasks with only 200,000 interactions per task with the environment. This
approach outperformed previous methods and even the manually designed
environment dense reward on both the final success rate and sample efficiency.
Moreover, we show that TimeRewarder pretraining can exploit real-world human
videos, highlighting its potential as a scalable approach path to rich reward
signals from diverse video sources.
☆ Graphite: A GPU-Accelerated Mixed-Precision Graph Optimization Framework
We present Graphite, a GPU-accelerated nonlinear graph optimization
framework. It provides a CUDA C++ interface to enable the sharing of code
between a realtime application, such as a SLAM system, and its optimization
tasks. The framework supports techniques to reduce memory usage, including
in-place optimization, support for multiple floating point types and
mixed-precision modes, and dynamically computed Jacobians. We evaluate Graphite
on well-known bundle adjustment problems and find that it achieves similar
performance to MegBA, a solver specialized for bundle adjustment, while
maintaining generality and using less memory. We also apply Graphite to global
visual-inertial bundle adjustment on maps generated from stereo-inertial SLAM
datasets, and observe speed ups of up to 59x compared to a CPU baseline. Our
results indicate that our solver enables faster large-scale optimization on
both desktop and resource-constrained devices.
★ The Trajectory Bundle Method: Unifying Sequential-Convex Programming and Sampling-Based Trajectory Optimization
Kevin Tracy, John Z. Zhang, Jon Arrizabalaga, Stefan Schaal, Yuval Tassa, Tom Erez, Zachary Manchester
We present a unified framework for solving trajectory optimization problems
in a derivative-free manner through the use of sequential convex programming.
Traditionally, nonconvex optimization problems are solved by forming and
solving a sequence of convex optimization problems, where the cost and
constraint functions are approximated locally through Taylor series expansions.
This presents a challenge for functions where differentiation is expensive or
unavailable. In this work, we present a derivative-free approach to form these
convex approximations by computing samples of the dynamics, cost, and
constraint functions and letting the solver interpolate between them. Our
framework includes sample-based trajectory optimization techniques like
model-predictive path integral (MPPI) control as a special case and generalizes
them to enable features like multiple shooting and general equality and
inequality constraints that are traditionally associated with derivative-based
sequential convex programming methods. The resulting framework is simple,
flexible, and capable of solving a wide variety of practical motion planning
and control problems.
☆ Radio-based Multi-Robot Odometry and Relative Localization
Radio-based methods such as Ultra-Wideband (UWB) and RAdio Detection And
Ranging (radar), which have traditionally seen limited adoption in robotics,
are experiencing a boost in popularity thanks to their robustness to harsh
environmental conditions and cluttered environments. This work proposes a
multi-robot UGV-UAV localization system that leverages the two technologies
with inexpensive and readily-available sensors, such as Inertial Measurement
Units (IMUs) and wheel encoders, to estimate the relative position of an aerial
robot with respect to a ground robot. The first stage of the system pipeline
includes a nonlinear optimization framework to trilaterate the location of the
aerial platform based on UWB range data, and a radar pre-processing module with
loosely coupled ego-motion estimation which has been adapted for a multi-robot
scenario. Then, the pre-processed radar data as well as the relative
transformation are fed to a pose-graph optimization framework with odometry and
inter-robot constraints. The system, implemented for the Robotic Operating
System (ROS 2) with the Ceres optimizer, has been validated in
Software-in-the-Loop (SITL) simulations and in a real-world dataset. The
proposed relative localization module outperforms state-of-the-art closed-form
methods which are less robust to noise. Our SITL environment includes a custom
Gazebo plugin for generating realistic UWB measurements modeled after real
data. Conveniently, the proposed factor graph formulation makes the system
readily extensible to full Simultaneous Localization And Mapping (SLAM).
Finally, all the code and experimental data is publicly available to support
reproducibility and to serve as a common open dataset for benchmarking.
☆ OceanGym: A Benchmark Environment for Underwater Embodied Agents
Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen
We introduce OceanGym, the first comprehensive benchmark for ocean underwater
embodied agents, designed to advance AI in one of the most demanding real-world
environments. Unlike terrestrial or aerial domains, underwater settings present
extreme perceptual and decision-making challenges, including low visibility,
dynamic ocean currents, making effective agent deployment exceptionally
difficult. OceanGym encompasses eight realistic task domains and a unified
agent framework driven by Multi-modal Large Language Models (MLLMs), which
integrates perception, memory, and sequential decision-making. Agents are
required to comprehend optical and sonar data, autonomously explore complex
environments, and accomplish long-horizon objectives under these harsh
conditions. Extensive experiments reveal substantial gaps between
state-of-the-art MLLM-driven agents and human experts, highlighting the
persistent difficulty of perception, planning, and adaptability in ocean
underwater environments. By providing a high-fidelity, rigorously designed
platform, OceanGym establishes a testbed for developing robust embodied AI and
transferring these capabilities to real-world autonomous ocean underwater
vehicles, marking a decisive step toward intelligent agents capable of
operating in one of Earth's last unexplored frontiers. The code and data are
available at https://github.com/OceanGPT/OceanGym.
comment: Work in progress
☆ Memory-Efficient 2D/3D Shape Assembly of Robot Swarms
Mean-shift-based approaches have recently emerged as the most effective
methods for robot swarm shape assembly tasks. These methods rely on image-based
representations of target shapes to compute local density gradients and perform
mean-shift exploration, which constitute their core mechanism. However, such
image representations incur substantial memory overhead, which can become
prohibitive for high-resolution or 3D shapes. To overcome this limitation, we
propose a memory-efficient tree map representation that hierarchically encodes
user-specified shapes and is applicable to both 2D and 3D scenarios. Building
on this representation, we design a behavior-based distributed controller that
enables assignment-free shape assembly. Comparative 2D and 3D simulations
against a state-of-the-art mean-shift algorithm demonstrate one to two orders
of magnitude lower memory usage and two to three times faster shape entry while
maintaining comparable uniformity. Finally, we validate the framework through
physical experiments with 6 to 7 UAVs, confirming its real-world practicality.
☆ Learning from Hallucinating Critical Points for Navigation in Dynamic Environments
Generating large and diverse obstacle datasets to learn motion planning in
environments with dynamic obstacles is challenging due to the vast space of
possible obstacle trajectories. Inspired by hallucination-based data synthesis
approaches, we propose Learning from Hallucinating Critical Points (LfH-CP), a
self-supervised framework for creating rich dynamic obstacle datasets based on
existing optimal motion plans without requiring expensive expert demonstrations
or trial-and-error exploration. LfH-CP factorizes hallucination into two
stages: first identifying when and where obstacles must appear in order to
result in an optimal motion plan, i.e., the critical points, and then
procedurally generating diverse trajectories that pass through these points
while avoiding collisions. This factorization avoids generative failures such
as mode collapse and ensures coverage of diverse dynamic behaviors. We further
introduce a diversity metric to quantify dataset richness and show that LfH-CP
produces substantially more varied training data than existing baselines.
Experiments in simulation demonstrate that planners trained on LfH-CP datasets
achieves higher success rates compared to a prior hallucination method.
☆ Analytic Conditions for Differentiable Collision Detection in Trajectory Optimization IROS
Optimization-based methods are widely used for computing fast, diverse
solutions for complex tasks such as collision-free movement or planning in the
presence of contacts. However, most of these methods require enforcing
non-penetration constraints between objects, resulting in a non-trivial and
computationally expensive problem. This makes the use of optimization-based
methods for planning and control challenging. In this paper, we present a
method to efficiently enforce non-penetration of sets while performing
optimization over their configuration, which is directly applicable to problems
like collision-aware trajectory optimization. We introduce novel differentiable
conditions with analytic expressions to achieve this. To enforce non-collision
between non-smooth bodies using these conditions, we introduce a method to
approximate polytopes as smooth semi-algebraic sets. We present several
numerical experiments to demonstrate the performance of the proposed method and
compare the performance with other baseline methods recently proposed in the
literature.
comment: 8 pages, 8 figures. Accepted to the IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS) 2025
☆ Unwinding Rotations Reduces VR Sickness in Nonsimulated Immersive Telepresence
Filip Kulisiewicz, Basak Sakcak, Evan G. Center, Juho Kalliokoski, Katherine J. Mimnaugh, Steven M. LaValle, Timo Ojala
Immersive telepresence, when a user views the video stream of a $360^\circ$
camera in a remote environment using a Head Mounted Display (HMD), has great
potential to improve the sense of being in a remote environment. In most cases
of immersive robotic telepresence, the camera is mounted on a mobile robot
which increases the portion of the environment that the remote user can
explore. However, robot motions can induce unpleasant symptoms associated with
Virtual Reality (VR) sickness, degrading the overall user experience. Previous
research has shown that unwinding the rotations of the robot, that is,
decoupling the rotations that the camera undergoes due to robot motions from
what is seen by the user, can increase user comfort and reduce VR sickness.
However, that work considered a virtual environment and a simulated robot. In
this work, to test whether the same hypotheses hold when the video stream from
a real camera is used, we carried out a user study $(n=36)$ in which the
unwinding rotations method was compared against coupled rotations in a task
completed through a panoramic camera mounted on a robotic arm. Furthermore,
within an inspection task which involved translations and rotations in three
dimensions, we tested whether unwinding the robot rotations impacted the
performance of users. The results show that the users found the unwinding
rotations method to be more comfortable and preferable, and that a reduced
level of VR sickness can be achieved without a significant impact on task
performance.
comment: 24th IEEE International Symposium on Mixed and Augmented Reality
(ISMAR)
☆ Real-time Velocity Profile Optimization for Time-Optimal Maneuvering with Generic Acceleration Constraints
The computation of time-optimal velocity profiles along prescribed paths,
subject to generic acceleration constraints, is a crucial problem in robot
trajectory planning, with particular relevance to autonomous racing. However,
the existing methods either support arbitrary acceleration constraints at high
computational cost or use conservative box constraints for computational
efficiency. We propose FBGA, a new \underline{F}orward-\underline{B}ackward
algorithm with \underline{G}eneric \underline{A}cceleration constraints, which
achieves both high accuracy and low computation time. FBGA operates forward and
backward passes to maximize the velocity profile in short, discretized path
segments, while satisfying user-defined performance limits. Tested on five
racetracks and two vehicle classes, FBGA handles complex, non-convex
acceleration constraints with custom formulations. Its maneuvers and lap times
closely match optimal control baselines (within $0.11\%$-$0.36\%$), while being
up to three orders of magnitude faster. FBGA maintains high accuracy even with
coarse discretization, making it well-suited for online multi-query trajectory
planning. Our open-source \texttt{C++} implementation is available at:
https://anonymous.4open.science/r/FB_public_RAL.
☆ SDA-PLANNER: State-Dependency Aware Adaptive Planner for Embodied Task Planning
Embodied task planning requires agents to produce executable actions in a
close-loop manner within the environment. With progressively improving
capabilities of LLMs in task decomposition, planning, and generalization,
current embodied task planning methods adopt LLM-based architecture.However,
existing LLM-based planners remain limited in three aspects, i.e., fixed
planning paradigms, lack of action sequence constraints, and error-agnostic. In
this work, we propose SDA-PLANNER, enabling an adaptive planning paradigm,
state-dependency aware and error-aware mechanisms for comprehensive embodied
task planning. Specifically, SDA-PLANNER introduces a State-Dependency Graph to
explicitly model action preconditions and effects, guiding the dynamic
revision. To handle execution error, it employs an error-adaptive replanning
strategy consisting of Error Backtrack and Diagnosis and Adaptive Action
SubTree Generation, which locally reconstructs the affected portion of the plan
based on the current environment state. Experiments demonstrate that
SDA-PLANNER consistently outperforms baselines in success rate and goal
completion, particularly under diverse error conditions.
☆ Kinodynamic Motion Planning for Mobile Robot Navigation across Inconsistent World Models RSS
Mobile ground robots lacking prior knowledge of an environment must rely on
sensor data to develop a model of their surroundings. In these scenarios,
consistent identification of obstacles and terrain features can be difficult
due to noise and algorithmic shortcomings, which can make it difficult for
motion planning systems to generate safe motions. One particular difficulty to
overcome is when regions of the cost map switch between being marked as
obstacles and free space through successive planning cycles. One potential
solution to this, which we refer to as Valid in Every Hypothesis (VEH), is for
the planning system to plan motions that are guaranteed to be safe through a
history of world models. Another approach is to track a history of world
models, and adjust node costs according to the potential penalty of needing to
reroute around previously hazardous areas. This work discusses three major
iterations on this idea. The first iteration, called PEH, invokes a sub-search
for every node expansion that crosses through a divergence point in the world
models. The second and third iterations, called GEH and GEGRH respectively,
defer the sub-search until after an edge expands into the goal region. GEGRH
uses an additional step to revise the graph based on divergent nodes in each
world. Initial results showed that, although PEH and GEH find more optimistic
solutions than VEH, they are unable to generate solutions in less than
one-second, which exceeds our requirements for field deployment. Analysis of
results from a field experiment in an unstructured, off-road environment on a
Clearpath Robotics Warthog UGV indicate that GEGRH finds lower cost
trajectories and has faster average planning times than VEH. Compared to
single-hypothesis (SH) search, where only the latest world model is considered,
GEGRH generates more conservative plans with a small increase in average
planning time.
comment: Presented at the Robotics: Science and Systems (RSS) 2025 Workshop on
Resilient Off-road Autonomous Robotics (ROAR)
☆ LLM-MCoX: Large Language Model-based Multi-robot Coordinated Exploration and Search
Autonomous exploration and object search in unknown indoor environments
remain challenging for multi-robot systems (MRS). Traditional approaches often
rely on greedy frontier assignment strategies with limited inter-robot
coordination. In this work, we introduce LLM-MCoX (LLM-based Multi-robot
Coordinated Exploration and Search), a novel framework that leverages Large
Language Models (LLMs) for intelligent coordination of both homogeneous and
heterogeneous robot teams tasked with efficient exploration and target object
search. Our approach combines real-time LiDAR scan processing for frontier
cluster extraction and doorway detection with multimodal LLM reasoning (e.g.,
GPT-4o) to generate coordinated waypoint assignments based on shared
environment maps and robot states. LLM-MCoX demonstrates superior performance
compared to existing methods, including greedy and Voronoi-based planners,
achieving 22.7% faster exploration times and 50% improved search efficiency in
large environments with 6 robots. Notably, LLM-MCoX enables natural
language-based object search capabilities, allowing human operators to provide
high-level semantic guidance that traditional algorithms cannot interpret.
☆ Anomaly detection for generic failure monitoring in robotic assembly, screwing and manipulation
Niklas Grambow, Lisa-Marie Fenner, Felipe Kempkes, Philip Hotz, Dingyuan Wan, Jörg Krüger, Kevin Haninger
Out-of-distribution states in robot manipulation often lead to unpredictable
robot behavior or task failure, limiting success rates and increasing risk of
damage. Anomaly detection (AD) can identify deviations from expected patterns
in data, which can be used to trigger failsafe behaviors and recovery
strategies. Prior work has applied data-driven AD to time series data in
specific robotic tasks, but its transferability across control strategies and
task types has not been shown. Leveraging time series data, such as
force/torque signals, allows to directly capture robot-environment
interactions, crucial for manipulation and online failure detection. Their
broad availability, high sampling rates, and low dimensionality enable high
temporal resolution and efficient processing. As robotic tasks can have widely
signal characteristics and requirements, AD methods which can be applied in the
same way to a wide range of tasks is needed, ideally with good data efficiency.
We examine three industrial robotic tasks, each presenting several anomalies.
Test scenarios in robotic cabling, screwing, and sanding are built, and
multimodal time series data is gathered. Several autoencoder-based methods are
compared, evaluating generalization across tasks and control methods (diffusion
policy, position, and impedance control). This allows us to validate the
integration of AD in complex tasks involving tighter tolerances and variation
from both the robot and its environment. Additionally, we evaluate data
efficiency, detection latency, and task characteristics which support robust
detection. The results indicate reliable detection with AUROC exceeding 0.93 in
failures in the cabling and screwing task, such as incorrect or misaligned
parts and obstructed targets. In the polishing task, only severe failures were
reliably detected, while more subtle failure types remained undetected.
☆ ISyHand: A Dexterous Multi-finger Robot Hand with an Articulated Palm
The rapid increase in the development of humanoid robots and customized
manufacturing solutions has brought dexterous manipulation to the forefront of
modern robotics. Over the past decade, several expensive dexterous hands have
come to market, but advances in hardware design, particularly in servo motors
and 3D printing, have recently facilitated an explosion of cheaper open-source
hands. Most hands are anthropomorphic to allow use of standard human tools, and
attempts to increase dexterity often sacrifice anthropomorphism. We introduce
the open-source ISyHand (pronounced easy-hand), a highly dexterous, low-cost,
easy-to-manufacture, on-joint servo-driven robot hand. Our hand uses
off-the-shelf Dynamixel motors, fasteners, and 3D-printed parts, can be
assembled within four hours, and has a total material cost of about 1,300 USD.
The ISyHands's unique articulated-palm design increases overall dexterity with
only a modest sacrifice in anthropomorphism. To demonstrate the utility of the
articulated palm, we use reinforcement learning in simulation to train the hand
to perform a classical in-hand manipulation task: cube reorientation. Our
novel, systematic experiments show that the simulated ISyHand outperforms the
two most comparable hands in early training phases, that all three perform
similarly well after policy convergence, and that the ISyHand significantly
outperforms a fixed-palm version of its own design. Additionally, we deploy a
policy trained on cube reorientation on the real hand, demonstrating its
ability to perform real-world dexterous manipulation.
comment: Accepted at IEEE Humanoids 2025
☆ Terrain-Awared LiDAR-Inertial Odometry for Legged-Wheel Robots Based on Radial Basis Function Approximation
An accurate odometry is essential for legged-wheel robots operating in
unstructured terrains such as bumpy roads and staircases. Existing methods
often suffer from pose drift due to their ignorance of terrain geometry. We
propose a terrain-awared LiDAR-Inertial odometry (LIO) framework that
approximates the terrain using Radial Basis Functions (RBF) whose centers are
adaptively selected and weights are recursively updated. The resulting smooth
terrain manifold enables ``soft constraints" that regularize the odometry
optimization and mitigates the $z$-axis pose drift under abrupt elevation
changes during robot's maneuver. To ensure the LIO's real-time performance, we
further evaluate the RBF-related terms and calculate the inverse of the sparse
kernel matrix with GPU parallelization. Experiments on unstructured terrains
demonstrate that our method achieves higher localization accuracy than the
state-of-the-art baselines, especially in the scenarios that have continuous
height changes or sparse features when abrupt height changes occur.
☆ Side Scan Sonar-based SLAM for Autonomous Algae Farm Monitoring
The transition of seaweed farming to an alternative food source on an
industrial scale relies on automating its processes through smart farming,
equivalent to land agriculture. Key to this process are autonomous underwater
vehicles (AUVs) via their capacity to automate crop and structural inspections.
However, the current bottleneck for their deployment is ensuring safe
navigation within farms, which requires an accurate, online estimate of the AUV
pose and map of the infrastructure. To enable this, we propose an efficient
side scan sonar-based (SSS) simultaneous localization and mapping (SLAM)
framework that exploits the geometry of kelp farms via modeling structural
ropes in the back-end as sequences of individual landmarks from each SSS ping
detection, instead of combining detections into elongated representations. Our
method outperforms state of the art solutions in hardware in the loop (HIL)
experiments on a real AUV survey in a kelp farm. The framework and dataset can
be found at https://github.com/julRusVal/sss_farm_slam.
☆ Autonomous Multi-Robot Infrastructure for AI-Enabled Healthcare Delivery and Diagnostics
This research presents a multi-robot system for inpatient care, designed
using swarm intelligence principles and incorporating wearable health sensors,
RF-based communication, and AI-driven decision support. Within a simulated
hospital environment, the system adopts a leader-follower swarm configuration
to perform patient monitoring, medicine delivery, and emergency assistance. Due
to ethical constraints, live patient trials were not conducted; instead,
validation was carried out through controlled self-testing with wearable
sensors. The Leader Robot acquires key physiological parameters, including
temperature, SpO2, heart rate, and fall detection, and coordinates other robots
when required. The Assistant Robot patrols corridors for medicine delivery,
while a robotic arm provides direct drug administration. The swarm-inspired
leader-follower strategy enhanced communication reliability and ensured
continuous monitoring, including automated email alerts to healthcare staff.
The system hardware was implemented using Arduino, Raspberry Pi, NRF24L01 RF
modules, and a HuskyLens AI camera. Experimental evaluation showed an overall
sensor accuracy above 94%, a 92% task-level success rate, and a 96%
communication reliability rate, demonstrating system robustness. Furthermore,
the AI-enabled decision support was able to provide early warnings of abnormal
health conditions, highlighting the potential of the system as a cost-effective
solution for hospital automation and patient safety.
comment: 11 pages, 5 figures, MSc dissertation submission draft, prepared for
conference/journal consideration
☆ Evolutionary Continuous Adaptive RL-Powered Co-Design for Humanoid Chin-Up Performance
Humanoid robots have seen significant advancements in both design and
control, with a growing emphasis on integrating these aspects to enhance
overall performance. Traditionally, robot design has followed a sequential
process, where control algorithms are developed after the hardware is
finalized. However, this can be myopic and prevent robots to fully exploit
their hardware capabilities. Recent approaches advocate for co-design,
optimizing both design and control in parallel to maximize robotic
capabilities. This paper presents the Evolutionary Continuous Adaptive RL-based
Co-Design (EA-CoRL) framework, which combines reinforcement learning (RL) with
evolutionary strategies to enable continuous adaptation of the control policy
to the hardware. EA-CoRL comprises two key components: Design Evolution, which
explores the hardware choices using an evolutionary algorithm to identify
efficient configurations, and Policy Continuous Adaptation, which fine-tunes a
task-specific control policy across evolving designs to maximize performance
rewards. We evaluate EA-CoRL by co-designing the actuators (gear ratios) and
control policy of the RH5 humanoid for a highly dynamic chin-up task,
previously unfeasible due to actuator limitations. Comparative results against
state-of-the-art RL-based co-design methods show that EA-CoRL achieves higher
fitness score and broader design space exploration, highlighting the critical
role of continuous policy adaptation in robot co-design.
☆ Conflict-Based Search and Prioritized Planning for Multi-Agent Path Finding Among Movable Obstacles
This paper investigates Multi-Agent Path Finding Among Movable Obstacles
(M-PAMO), which seeks collision-free paths for multiple agents from their start
to goal locations among static and movable obstacles. M-PAMO arises in
logistics and warehouses where mobile robots are among unexpected movable
objects. Although Multi-Agent Path Finding (MAPF) and single-agent Path
planning Among Movable Obstacles (PAMO) were both studied, M-PAMO remains
under-explored. Movable obstacles lead to new fundamental challenges as the
state space, which includes both agents and movable obstacles, grows
exponentially with respect to the number of agents and movable obstacles. In
particular, movable obstacles often closely couple agents together spatially
and temporally. This paper makes a first attempt to adapt and fuse the popular
Conflict-Based Search (CBS) and Prioritized Planning (PP) for MAPF, and a
recent single-agent PAMO planner called PAMO*, together to address M-PAMO. We
compare their performance with up to 20 agents and hundreds of movable
obstacles, and show the pros and cons of these approaches.
☆ Towards Human Engagement with Realistic AI Combat Pilots
We present a system that enables real-time interaction between human users
and agents trained to control fighter jets in simulated 3D air combat
scenarios. The agents are trained in a dedicated environment using Multi-Agent
Reinforcement Learning. A communication link is developed to allow seamless
deployment of trained agents into VR-Forces, a widely used defense simulation
tool for realistic tactical scenarios. This integration allows mixed
simulations where human-controlled entities engage with intelligent agents
exhibiting distinct combat behaviors. Our interaction model creates new
opportunities for human-agent teaming, immersive training, and the exploration
of innovative tactics in defense contexts.
comment: 13th International Conference on Human-Agent Interaction (HAI) 2025
☆ On the Conic Complementarity of Planar Contacts
We present a unifying theoretical result that connects two foundational
principles in robotics: the Signorini law for point contacts, which underpins
many simulation methods for preventing object interpenetration, and the center
of pressure (also known as the zero-moment point), a key concept used in, for
instance, optimization-based locomotion control. Our contribution is the planar
Signorini condition, a conic complementarity formulation that models general
planar contacts between rigid bodies. We prove that this formulation is
equivalent to enforcing the punctual Signorini law across an entire contact
surface, thereby bridging the gap between discrete and continuous contact
models. A geometric interpretation reveals that the framework naturally
captures three physical regimes -sticking, separating, and tilting-within a
unified complementarity structure. This leads to a principled extension of the
classical center of pressure, which we refer to as the extended center of
pressure. By establishing this connection, our work provides a mathematically
consistent and computationally tractable foundation for handling planar
contacts, with implications for both the accurate simulation of contact
dynamics and the design of advanced control and optimization algorithms in
locomotion and manipulation.
☆ Emotionally Expressive Robots: Implications for Children's Behavior toward Robot
The growing development of robots with artificial emotional expressiveness
raises important questions about their persuasive potential in children's
behavior. While research highlights the pragmatic value of emotional
expressiveness in human social communication, the extent to which robotic
expressiveness can or should influence empathic responses in children is
grounds for debate. In a pilot study with 22 children (aged 7-11) we begin to
explore the ways in which different levels of embodied expressiveness (body
only, face only, body and face) of two basic emotions (happiness and sadness)
displayed by an anthropomorphic robot (QTRobot) might modify children's
behavior in a child-robot cooperative turn-taking game. We observed that
children aligned their behavior to the robot's inferred emotional state.
However, higher levels of expressiveness did not result in increased alignment.
The preliminary results reported here provide a starting point for reflecting
on robotic expressiveness and its role in shaping children's social-emotional
behavior toward robots as social peers in the near future.
☆ S$^3$E: Self-Supervised State Estimation for Radar-Inertial System
Millimeter-wave radar for state estimation is gaining significant attention
for its affordability and reliability in harsh conditions. Existing
localization solutions typically rely on post-processed radar point clouds as
landmark points. Nonetheless, the inherent sparsity of radar point clouds,
ghost points from multi-path effects, and limited angle resolution in
single-chirp radar severely degrade state estimation performance. To address
these issues, we propose S$^3$E, a \textbf{S}elf-\textbf{S}upervised
\textbf{S}tate \textbf{E}stimator that employs more richly informative radar
signal spectra to bypass sparse points and fuses complementary inertial
information to achieve accurate localization. S$^3$E fully explores the
association between \textit{exteroceptive} radar and \textit{proprioceptive}
inertial sensor to achieve complementary benefits. To deal with limited angle
resolution, we introduce a novel cross-fusion technique that enhances spatial
structure information by exploiting subtle rotational shift correlations across
heterogeneous data. The experimental results demonstrate our method achieves
robust and accurate performance without relying on localization ground truth
supervision. To the best of our knowledge, this is the first attempt to achieve
state estimation by fusing radar spectra and inertial data in a complementary
self-supervised manner.
☆ MUVLA: Learning to Explore Object Navigation via Map Understanding
In this paper, we present MUVLA, a Map Understanding Vision-Language-Action
model tailored for object navigation. It leverages semantic map abstractions to
unify and structure historical information, encoding spatial context in a
compact and consistent form. MUVLA takes the current and history observations,
as well as the semantic map, as inputs and predicts the action sequence based
on the description of goal object. Furthermore, it amplifies supervision
through reward-guided return modeling based on dense short-horizon progress
signals, enabling the model to develop a detailed understanding of action value
for reward maximization. MUVLA employs a three-stage training pipeline:
learning map-level spatial understanding, imitating behaviors from
mixed-quality demonstrations, and reward amplification. This strategy allows
MUVLA to unify diverse demonstrations into a robust spatial representation and
generate more rational exploration strategies. Experiments on HM3D and Gibson
benchmarks demonstrate that MUVLA achieves great generalization and learns
effective exploration behaviors even from low-quality or partially successful
trajectories.
☆ Towards Intuitive Human-Robot Interaction through Embodied Gesture-Driven Control with Woven Tactile Skins
ChunPing Lam, Xiangjia Chen, Chenming Wu, Hao Chen, Binzhi Sun, Guoxin Fang, Charlie C. L. Wang, Chengkai Dai, Yeung Yam
This paper presents a novel human-robot interaction (HRI) framework that
enables intuitive gesture-driven control through a capacitance-based woven
tactile skin. Unlike conventional interfaces that rely on panels or handheld
devices, the woven tactile skin integrates seamlessly with curved robot
surfaces, enabling embodied interaction and narrowing the gap between human
intent and robot response. Its woven design combines fabric-like flexibility
with structural stability and dense multi-channel sensing through the
interlaced conductive threads. Building on this capability, we define a
gesture-action mapping of 14 single- and multi-touch gestures that cover
representative robot commands, including task-space motion and auxiliary
functions. A lightweight convolution-transformer model designed for gesture
recognition in real time achieves an accuracy of near-100%, outperforming prior
baseline approaches. Experiments on robot arm tasks, including pick-and-place
and pouring, demonstrate that our system reduces task completion time by up to
57% compared with keyboard panels and teach pendants. Overall, our proposed
framework demonstrates a practical pathway toward more natural and efficient
embodied HRI.
☆ State Estimation for Compliant and Morphologically Adaptive Robots ICRA 2026
Locomotion robots with active or passive compliance can show robustness to
uncertain scenarios, which can be promising for agricultural, research and
environmental industries. However, state estimation for these robots is
challenging due to the lack of rigid-body assumptions and kinematic changes
from morphing. We propose a method to estimate typical rigid-body states
alongside compliance-related states, such as soft robot shape in different
morphologies and locomotion modes. Our neural network-based state estimator
uses a history of states and a mechanism to directly influence unreliable
sensors. We test our framework on the GOAT platform, a robot capable of passive
compliance and active morphing for extreme outdoor terrain. The network is
trained on motion capture data in a novel compliance-centric frame that
accounts for morphing-related states. Our method predicts shape-related
measurements within 4.2% of the robot's size, velocities within 6.3% and 2.4%
of the top linear and angular speeds, respectively, and orientation within 1.5
degrees. We also demonstrate a 300% increase in travel range during a motor
malfunction when using our estimator for closed-loop autonomous outdoor
operation.
comment: 8 pages, 10 figures, 1 table, submitted to ICRA 2026
☆ Preemptive Spatiotemporal Trajectory Adjustment for Heterogeneous Vehicles in Highway Merging Zones
Yuan Li, Xiaoxue Xu, Xiang Dong, Junfeng Hao, Tao Li, Sana Ullaha, Chuangrui Huang, Junjie Niu, Ziyan Zhao, Ting Peng
Aiming at the problem of driver's perception lag and low utilization
efficiency of space-time resources in expressway ramp confluence area, based on
the preemptive spatiotemporal trajectory Adjustment system, from the
perspective of coordinating spatiotemporal resources, the reasonable value of
safe space-time distance in trajectory pre-preparation is quantitatively
analyzed. The minimum safety gap required for ramp vehicles to merge into the
mainline is analyzed by introducing double positioning error and spatiotemporal
trajectory tracking error. A merging control strategy for autonomous driving
heterogeneous vehicles is proposed, which integrates vehicle type, driving
intention, and safety spatiotemporal distance. The specific confluence
strategies of ramp target vehicles and mainline cooperative vehicles under
different vehicle types are systematically expounded. A variety of traffic flow
and speed scenarios are used for full combination simulation. By comparing the
time-position-speed diagram, the vehicle operation characteristics and the
dynamic difference of confluence are qualitatively analyzed, and the average
speed and average delay are used as the evaluation indices to quantitatively
evaluate the performance advantages of the preemptive cooperative confluence
control strategy. The results show that the maximum average delay improvement
rates of mainline and ramp vehicles are 90.24 % and 74.24 %, respectively. The
proposed strategy can effectively avoid potential vehicle conflicts and
emergency braking behaviors, improve driving safety in the confluence area, and
show significant advantages in driving stability and overall traffic efficiency
optimization.
☆ Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation
Zitong Bo, Yue Hu, Jinming Ma, Mingliang Zhou, Junhui Yin, Yachen Kang, Yuqi Liu, Tong Wu, Diyun Xiang, Hao Chen
Enabling robots to execute long-horizon manipulation tasks from free-form
language instructions remains a fundamental challenge in embodied AI. While
vision-language models (VLMs) have shown promise as high-level planners, their
deployment in the real world is hindered by two gaps: (i) the scarcity of
large-scale, sequential manipulation data that couples natural language with
multi-step action plans, and (ii) the absence of dense, interpretable rewards
for fine-tuning VLMs on planning objectives. To address these issues, we
propose REVER, a framework that empowers VLMs to generate and validate
long-horizon manipulation plans from natural language instructions in
real-world scenarios. Under REVER we train and release RoboFarseer, a VLM
incentivized to emit chain-of-thought that perform temporal and spatial
reasoning, ensuring physically plausible and logically coherent plans. To
obtain training data, we leverage the Universal Manipulation Interface
framework to capture hardware-agnostic demonstrations of atomic skills. An
automated annotation engine converts each demonstration into
vision-instruction-plan triplet. We introduce a verifiable reward that scores
the generated plan by its ordered bipartite matching overlap with the
ground-truth skill sequence. At run time, the fine-tuned VLM functions both as
a planner and as a monitor, verifying step-wise completion. RoboFarseer matches
or exceeds the performance of proprietary models that are orders of magnitude
larger, while on open-ended planning it surpasses the best baseline by more
than 40%. In real-world, long-horizon tasks, the complete system boosts overall
success by roughly 60% compared with the same low-level controller without the
planner. We will open-source both the dataset and the trained model upon
publication.
☆ Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies NeurIPS 2025
Existing imitation learning methods decouple perception and action, which
overlooks the causal reciprocity between sensory representations and action
execution that humans naturally leverage for adaptive behaviors. To bridge this
gap, we introduce Action--Guided Diffusion Policy (DP--AG), a unified
representation learning that explicitly models a dynamic interplay between
perception and action through probabilistic latent dynamics. DP--AG encodes
latent observations into a Gaussian posterior via variational inference and
evolves them using an action-guided SDE, where the Vector-Jacobian Product
(VJP) of the diffusion policy's noise predictions serves as a structured
stochastic force driving latent updates. To promote bidirectional learning
between perception and action, we introduce a cycle--consistent contrastive
loss that organizes the gradient flow of the noise predictor into a coherent
perception--action loop, enforcing mutually consistent transitions in both
latent updates and action refinements. Theoretically, we derive a variational
lower bound for the action-guided SDE, and prove that the contrastive objective
enhances continuity in both latent and action trajectories. Empirically, DP--AG
significantly outperforms state--of--the--art methods across simulation
benchmarks and real-world UR5 manipulation tasks. As a result, our DP--AG
offers a promising step toward bridging biological adaptability and artificial
policy learning.
comment: 42 pages, 17 figures, 39th Conference on Neural Information
Processing Systems (NeurIPS 2025)
☆ SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
Yixian Zhang, Shu'ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding
Training expressive flow-based policies with off-policy reinforcement
learning is notoriously unstable due to gradient pathologies in the multi-step
action sampling process. We trace this instability to a fundamental connection:
the flow rollout is algebraically equivalent to a residual recurrent
computation, making it susceptible to the same vanishing and exploding
gradients as RNNs. To address this, we reparameterize the velocity network
using principles from modern sequential models, introducing two stable
architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which
utilizes a decoded velocity. We then develop a practical SAC-based algorithm,
enabled by a noise-augmented rollout, that facilitates direct end-to-end
training of these policies. Our approach supports both from-scratch and
offline-to-online learning and achieves state-of-the-art performance on
continuous control and robotic manipulation benchmarks, eliminating the need
for common workarounds like policy distillation or surrogate objectives.
☆ Best of Sim and Real: Decoupled Visuomotor Manipulation via Learning Control in Simulation and Perception in Real
Sim-to-real transfer remains a fundamental challenge in robot manipulation
due to the entanglement of perception and control in end-to-end learning. We
present a decoupled framework that learns each component where it is most
reliable: control policies are trained in simulation with privileged state to
master spatial layouts and manipulation dynamics, while perception is adapted
only at deployment to bridge real observations to the frozen control policy.
Our key insight is that control strategies and action patterns are universal
across environments and can be learned in simulation through systematic
randomization, while perception is inherently domain-specific and must be
learned where visual observations are authentic. Unlike existing end-to-end
approaches that require extensive real-world data, our method achieves strong
performance with only 10-20 real demonstrations by reducing the complex
sim-to-real problem to a structured perception alignment task. We validate our
approach on tabletop manipulation tasks, demonstrating superior data efficiency
and out-of-distribution generalization compared to end-to-end baselines. The
learned policies successfully handle object positions and scales beyond the
training distribution, confirming that decoupling perception from control
fundamentally improves sim-to-real transfer.
comment: 10 pages, 6 figures
☆ TacRefineNet: Tactile-Only Grasp Refinement Between Arbitrary In-Hand Object Poses
Despite progress in both traditional dexterous grasping pipelines and recent
Vision-Language-Action (VLA) approaches, the grasp execution stage remains
prone to pose inaccuracies, especially in long-horizon tasks, which undermines
overall performance. To address this "last-mile" challenge, we propose
TacRefineNet, a tactile-only framework that achieves fine in-hand pose
refinement of known objects in arbitrary target poses using multi-finger
fingertip sensing. Our method iteratively adjusts the end-effector pose based
on tactile feedback, aligning the object to the desired configuration. We
design a multi-branch policy network that fuses tactile inputs from multiple
fingers along with proprioception to predict precise control updates. To train
this policy, we combine large-scale simulated data from a physics-based tactile
model in MuJoCo with real-world data collected from a physical system.
Comparative experiments show that pretraining on simulated data and fine-tuning
with a small amount of real data significantly improves performance over
simulation-only training. Extensive real-world experiments validate the
effectiveness of the method, achieving millimeter-level grasp accuracy using
only tactile input. To our knowledge, this is the first method to enable
arbitrary in-hand pose refinement via multi-finger tactile sensing alone.
Project website is available at https://sites.google.com/view/tacrefinenet
comment: 9 pages, 9 figures
☆ Boundary-to-Region Supervision for Offline Safe Reinforcement Learning NeurIPS 2025
Offline safe reinforcement learning aims to learn policies that satisfy
predefined safety constraints from static datasets. Existing
sequence-model-based methods condition action generation on symmetric input
tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry:
return-to-go (RTG) serves as a flexible performance target, while cost-to-go
(CTG) should represent a rigid safety boundary. This symmetric conditioning
leads to unreliable constraint satisfaction, especially when encountering
out-of-distribution cost trajectories. To address this, we propose
Boundary-to-Region (B2R), a framework that enables asymmetric conditioning
through cost signal realignment . B2R redefines CTG as a boundary constraint
under a fixed safety budget, unifying the cost distribution of all feasible
trajectories while preserving reward structures. Combined with rotary
positional embeddings , it enhances exploration within the safe region.
Experimental results show that B2R satisfies safety constraints in 35 out of 38
safety-critical tasks while achieving superior reward performance over baseline
methods. This work highlights the limitations of symmetric token conditioning
and establishes a new theoretical and practical approach for applying sequence
models to safe RL. Our code is available at https://github.com/HuikangSu/B2R.
comment: NeurIPS 2025
☆ VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning
Si-Cheng Wang, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Ao-Qun Jin, Zeng-Guang Hou
Reinforcement learning (RL) is a promising avenue for post-training
vision-language-action (VLA) models, but practical deployment is hindered by
sparse rewards and unstable training. This work mitigates these challenges by
introducing an action chunk based on proximal policy optimization (PPO) with
behavior cloning using self-collected demonstrations. Aggregating consecutive
actions into chunks improves the temporal consistency of the policy and the
density of informative feedback. In addition, an auxiliary behavior cloning
loss is applied with a dynamically updated demonstration buffer that
continually collects high-quality task trials during training. The relative
weight between the action-chunked PPO objective and the self behavior clone
auxiliary loss is adapted online to stabilize the post-training process.
Experiments on the MetaWorld benchmark indicate improved performance over
supervised fine-tuning, achieving a high success rate (0.93) and few steps to
success (42.17). These results demonstrate the viability of RL for VLA
post-training and help lay the groundwork for downstream VLA applications.
☆ OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation
Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jintao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, Zedong Chu
Embodied navigation presents a core challenge for intelligent robots,
requiring the comprehension of visual environments, natural language
instructions, and autonomous exploration. Existing models often fall short in
offering a unified solution across diverse navigation paradigms, resulting in
low success rates and limited generalization. We introduce OmniNav, a unified
framework addressing instruct-goal, object-goal, point-goal navigation, and
frontier-based exploration within a single architecture. Our approach features
a lightweight, low-latency policy that accurately predicts continuous-space
waypoints (coordinates and orientations). This policy surpasses action-chunk
methods in precision and supports real-world deployment at control frequencies
up to 5 Hz. Architecturally, OmniNav employs a fast-slow system design: a fast
module generates waypoints using short-horizon visual context and subtasks,
while a slow module performs deliberative planning with long-horizon
observations and candidate frontiers to select subsequent subgoals and
subtasks. This collaboration enhances path efficiency and maintains trajectory
coherence, particularly in exploration and memory-intensive scenarios.
Crucially, we identify that the primary bottleneck isn't merely navigation
policy learning, but a robust understanding of general instructions and
objects. To boost generalization, OmniNav integrates large-scale,
general-purpose training datasets, including those for image captioning and
visual recognition, into a joint multi-task regimen. This significantly
improves success rates and robustness. Extensive experiments confirm OmniNav's
state-of-the-art performance across various navigation benchmarks, with
real-world deployment further validating its efficacy. OmniNav provides
practical insights for embodied navigation, charting a scalable path towards
versatile, highly generalizable robotic intelligence.
☆ Hierarchical Diffusion Motion Planning with Task-Conditioned Uncertainty-Aware Priors
We propose a novel hierarchical diffusion planner that embeds task and motion
structure directly in the noise model. Unlike standard diffusion-based planners
that use zero-mean, isotropic Gaussian noise, we employ a family of
task-conditioned structured Gaussians whose means and covariances are derived
from Gaussian Process Motion Planning (GPMP): sparse, task-centric key states
or their associated timings (or both) are treated as noisy observations to
produce a prior instance. We first generalize the standard diffusion process to
biased, non-isotropic corruption with closed-form forward and posterior
expressions. Building on this, our hierarchy separates prior instantiation from
trajectory denoising: the upper level instantiates a task-conditioned
structured Gaussian (mean and covariance), and the lower level denoises the
full trajectory under that fixed prior. Experiments on Maze2D goal-reaching and
KUKA block stacking show improved success rates, smoother trajectories, and
stronger task alignment compared to isotropic baselines. Ablation studies
indicate that explicitly structuring the corruption process offers benefits
beyond simply conditioning the neural network. Overall, our method concentrates
probability mass of prior near feasible, smooth, and semantically meaningful
trajectories while maintaining tractability. Our project page is available at
https://hta-diffusion.github.io.
★ dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, Yi Xu
Vision-Language-Action (VLA) models are emerging as a next-generation
paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages
a multimodal chain-of-thought to unify visual perception, language reasoning,
and robotic control in a single system. dVLA jointly optimizes perception,
language understanding, and action under a single diffusion objective, enabling
stronger cross-modal reasoning and better generalization to novel instructions
and objects. For practical deployment, we mitigate inference latency by
incorporating two acceleration strategies, a prefix attention mask and KV
caching, yielding up to around times speedup at test-time inference. We
evaluate dVLA in both simulation and the real world: on the LIBERO benchmark,
it achieves state-of-the-art performance with a 96.4% average success rate,
consistently surpassing both discrete and continuous action policies; on a real
Franka robot, it succeeds across a diverse task suite, including a challenging
bin-picking task that requires multi-step planning, demonstrating robust
real-world performance. Together, these results underscore the promise of
unified diffusion frameworks for practical, high-performance VLA robotics.
comment: technique report
☆ Field Calibration of Hyperspectral Cameras for Terrain Inference
Nathaniel Hanson, Benjamin Pyatski, Samuel Hibbard, Gary Lvov, Oscar De La Garza, Charles DiMarzio, Kristen L. Dorsey, Taşkın Padır
Intra-class terrain differences such as water content directly influence a
vehicle's ability to traverse terrain, yet RGB vision systems may fail to
distinguish these properties. Evaluating a terrain's spectral content beyond
red-green-blue wavelengths to the near infrared spectrum provides useful
information for intra-class identification. However, accurate analysis of this
spectral information is highly dependent on ambient illumination. We
demonstrate a system architecture to collect and register multi-wavelength,
hyperspectral images from a mobile robot and describe an approach to
reflectance calibrate cameras under varying illumination conditions. To
showcase the practical applications of our system, HYPER DRIVE, we demonstrate
the ability to calculate vegetative health indices and soil moisture content
from a mobile robot platform.
comment: Accepted to IEEE Robotics & Automation Letters
♻ ☆ Visual-auditory Extrinsic Contact Estimation
Robust manipulation often hinges on a robot's ability to perceive extrinsic
contacts-contacts between a grasped object and its surrounding environment.
However, these contacts are difficult to observe through vision alone due to
occlusions, limited resolution, and ambiguous near-contact states. In this
paper, we propose a visual-auditory method for extrinsic contact estimation
that integrates global scene information from vision with local contact cues
obtained through active audio sensing. Our approach equips a robotic gripper
with contact microphones and conduction speakers, enabling the system to emit
and receive acoustic signals through the grasped object to detect external
contacts. We train our perception pipeline entirely in simulation and zero-shot
transfer to the real world. To bridge the sim-to-real gap, we introduce a
real-to-sim audio hallucination technique, injecting real-world audio samples
into simulated scenes with ground-truth contact labels. The resulting
multimodal model accurately estimates both the location and size of extrinsic
contacts across a range of cluttered and occluded scenarios. We further
demonstrate that explicit contact prediction significantly improves policy
learning for downstream contact-rich manipulation tasks.
comment: 8 pages, 7 figures
♻ ☆ AuDeRe: Automated Strategy Decision and Realization in Robot Planning and Control via LLMs
Recent advancements in large language models (LLMs) have shown significant
promise in various domains, especially robotics. However, most prior LLM-based
work in robotic applications either directly predicts waypoints or applies LLMs
within fixed tool integration frameworks, offering limited flexibility in
exploring and configuring solutions best suited to different tasks. In this
work, we propose a framework that leverages LLMs to select appropriate planning
and control strategies based on task descriptions, environmental constraints,
and system dynamics. These strategies are then executed by calling the
available comprehensive planning and control APIs. Our approach employs
iterative LLM-based reasoning with performance feedback to refine the algorithm
selection. We validate our approach through extensive experiments across tasks
of varying complexity, from simple tracking to complex planning scenarios
involving spatiotemporal constraints. The results demonstrate that using LLMs
to determine planning and control strategies from natural language descriptions
significantly enhances robotic autonomy while reducing the need for extensive
manual tuning and expert knowledge. Furthermore, our framework maintains
generalizability across different tasks and notably outperforms baseline
methods that rely on LLMs for direct trajectory, control sequence, or code
generation.
comment: 8 pages, 14 figures, submitted to the 2026 American Control
Conference
♻ ☆ Robot Conga: A Leader-Follower Walking Approach to Sequential Path Following in Multi-Agent Systems
Coordinated path following in multi-agent systems is a key challenge in
robotics, with applications in automated logistics, surveillance, and
collaborative exploration. Traditional formation control techniques often rely
on time-parameterized trajectories and path integrals, which can result in
synchronization issues and rigid behavior. In this work, we address the problem
of sequential path following, where agents maintain fixed spatial separation
along a common trajectory, guided by a leader under centralized control. We
introduce Robot Conga, a leader-follower control strategy that updates each
agent's desired state based on the leader's spatial displacement rather than
time, assuming access to a global position reference, an assumption valid in
indoor environments equipped with motion capture, vision-based tracking, or UWB
localization systems. The algorithm was validated in simulation using both
TurtleBot3 and quadruped (Laikago) robots. Results demonstrate accurate
trajectory tracking, stable inter-agent spacing, and fast convergence, with all
agents aligning within 250 time steps (approx. 0.25 seconds) in the quadruped
case, and almost instantaneously in the TurtleBot3 implementation.
comment: 6 Pages, 8 Figures. Both authors have contributed equally
♻ ☆ Apple: Toward General Active Perception via Reinforcement Learning
Active perception is a fundamental skill that enables us humans to deal with
uncertainty in our inherently partially observable environment. For senses such
as touch, where the information is sparse and local, active perception becomes
crucial. In recent years, active perception has emerged as an important
research domain in robotics. However, current methods are often bound to
specific tasks or make strong assumptions, which limit their generality. To
address this gap, this work introduces APPLE (Active Perception Policy
Learning) - a novel framework that leverages reinforcement learning (RL) to
address a range of different active perception problems. APPLE jointly trains a
transformer-based perception module and decision-making policy with a unified
optimization objective, learning how to actively gather information. By design,
APPLE is not limited to a specific task and can, in principle, be applied to a
wide range of active perception problems. We evaluate two variants of APPLE
across different tasks, including tactile exploration problems from the Tactile
MNIST benchmark. Experiments demonstrate the efficacy of APPLE, achieving high
accuracies on both regression and classification tasks. These findings
underscore the potential of APPLE as a versatile and general framework for
advancing active perception in robotics.
comment: 16 pages; 13 figures Under Review
♻ ☆ Find the Fruit: Zero-Shot Sim2Real RL for Occlusion-Aware Plant Manipulation
Autonomous harvesting in the open presents a complex manipulation problem. In
most scenarios, an autonomous system has to deal with significant occlusion and
require interaction in the presence of large structural uncertainties (every
plant is different). Perceptual and modeling uncertainty make design of
reliable manipulation controllers for harvesting challenging, resulting in poor
performance during deployment. We present a sim2real reinforcement learning
(RL) framework for occlusion-aware plant manipulation, where a policy is
learned entirely in simulation to reposition stems and leaves to reveal target
fruit(s). In our proposed approach, we decouple high-level kinematic planning
from low-level compliant control which simplifies the sim2real transfer. This
decomposition allows the learned policy to generalize across multiple plants
with different stiffness and morphology. In experiments with multiple
real-world plant setups, our system achieves up to 86.7% success in exposing
target fruits, demonstrating robustness to occlusion variation and structural
uncertainty.
comment: 9 Pages, 3 Figures, 1 Table
♻ ☆ DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion
Dvij Kalaria, Sudarshan S Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, Jonathan Chung-Kuan Huang
We introduce DreamControl, a novel methodology for learning autonomous
whole-body humanoid skills. DreamControl leverages the strengths of diffusion
models and Reinforcement Learning (RL): our core innovation is the use of a
diffusion prior trained on human motion data, which subsequently guides an RL
policy in simulation to complete specific tasks of interest (e.g., opening a
drawer or picking up an object). We demonstrate that this human motion-informed
prior allows RL to discover solutions unattainable by direct RL, and that
diffusion models inherently promote natural looking motions, aiding in
sim-to-real transfer. We validate DreamControl's effectiveness on a Unitree G1
robot across a diverse set of challenging tasks involving simultaneous lower
and upper body control and object interaction. Project website at
https://genrobo.github.io/DreamControl/
comment: https://genrobo.github.io/DreamControl/ (under submission)
♻ ☆ Multi Layered Autonomy and AI Ecologies in Robotic Art Installations
This paper presents Symbiosis of Agents, is a large-scale installation by
Baoyang Chen (baoyangchen.com), that embeds AI-driven robots in an immersive,
mirror-lined arena, probing the tension between machine agency and artistic
authorship. Drawing on early cybernetics, rule-based conceptual art, and
seminal robotic works, it orchestrates fluid exchanges among robotic arms,
quadruped machines, their environment, and the public. A three tier faith
system pilots the ecology: micro-level adaptive tactics, meso-level narrative
drives, and a macro-level prime directive. This hierarchy lets behaviors evolve
organically in response to environmental cues and even a viewer's breath,
turning spectators into co-authors of the unfolding drama. Framed by a
speculative terraforming scenario that recalls the historical exploitation of
marginalized labor, the piece asks who bears responsibility in AI-mediated
futures. Choreographed motion, AI-generated scripts, reactive lighting, and
drifting fog cast the robots as collaborators rather than tools, forging a
living, emergent artwork. Exhibited internationally, Symbiosis of Agents shows
how cybernetic feedback, robotic experimentation, and conceptual rule-making
can converge to redefine agency, authorship, and ethics in contemporary art.
♻ ☆ Ocean Diviner: A Diffusion-Augmented Reinforcement Learning Framework for AUV Robust Control in Underwater Tasks
Jingzehua Xu, Guanwen Xie, Weiyi Liu, Jiwei Tang, Ziteng Yang, Tianxiang Xing, Yiyuan Yang, Shuai Zhang, Xiaofan Li
Autonomous Underwater Vehicles (AUVs) are essential for marine exploration,
yet their control remains highly challenging due to nonlinear dynamics and
uncertain environmental disturbances. This paper presents a diffusion-augmented
Reinforcement Learning (RL) framework for robust AUV control, aiming to improve
AUV's adaptability in dynamic underwater environments. The proposed framework
integrates two core innovations: (1) A diffusion-based action generation
framework that produces physically feasible and high-quality actions, enhanced
by a high-dimensional state encoding mechanism combining current observations
with historical states and actions through a novel diffusion U-Net
architecture, significantly improving long-horizon planning capacity for robust
control. (2) A sample-efficient hybrid learning architecture that synergizes
diffusion-guided exploration with RL policy optimization, where the diffusion
model generates diverse candidate actions and the RL critic selects the optimal
action, achieving higher exploration efficiency and policy stability in dynamic
underwater environments. Extensive simulation experiments validate the
framework's superior robustness and flexibility, outperforming conventional
control methods in challenging marine conditions, offering enhanced
adaptability and reliability for AUV operations in underwater tasks. Finally,
we will release the code publicly soon to support future research in this area.
comment: Jingzehua Xu, Guanwen Xie and Weiyi Liu contributed equally to this
work
♻ ☆ LiDAR-BIND-T: Improved and Temporally Consistent Sensor Modality Translation and Fusion for Robotic Applications
This paper extends LiDAR-BIND, a modular multi-modal fusion framework that
binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space,
with mechanisms that explicitly enforce temporal consistency. We introduce
three contributions: (i) temporal embedding similarity that aligns consecutive
latent representations, (ii) a motion-aligned transformation loss that matches
displacement between predictions and ground truth LiDAR, and (iii) windowed
temporal fusion using a specialised temporal module. We further update the
model architecture to better preserve spatial structure. Evaluations on
radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial
coherence, yielding lower absolute trajectory error and better occupancy map
accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We
propose different metrics based on the Fr\'echet Video Motion Distance (FVMD)
and a correlation-peak distance metric providing practical temporal quality
indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or
LiDAR-BIND-T, maintains modular modality fusion while substantially enhancing
temporal stability, resulting in improved robustness and performance for
downstream SLAM.
♻ ☆ AgriCruiser: An Open Source Agriculture Robot for Over-the-row Navigation
Kenny Truong, Yongkyu Lee, Jason Irie, Shivam Kumar Panda, Mohammad Jony, Shahab Ahmad, Md. Mukhlesur Rahman, M. Khalid Jawed
We present the AgriCruiser, an open-source over-the-row agricultural robot
developed for low-cost deployment and rapid adaptation across diverse crops and
row layouts. The chassis provides an adjustable track width of 1.42 m to 1.57
m, along with a ground clearance of 0.94 m. The AgriCruiser achieves compact
pivot turns with radii of 0.71 m to 0.79 m, enabling efficient headland
maneuvers. The platform is designed for the integration of the other
subsystems, and in this study, a precision spraying system was implemented to
assess its effectiveness in weed management. In twelve flax plots, a single
robotic spray pass reduced total weed populations (pigweed and Venice mallow)
by 24- to 42-fold compared to manual weeding in four flax plots, while also
causing less crop damage. Mobility experiments conducted on concrete, asphalt,
gravel, grass, and both wet and dry soil confirmed reliable traversal
consistent with torque sizing. The complete chassis can be constructed from
commodity T-slot extrusion with minimal machining, resulting in a bill of
materials costing approximately $5,000 - $6,000, which enables replication and
customization. The mentioned results demonstrate that low-cost, reconfigurable
over-the-row robots can achieve effective weed management with reduced crop
damage and labor requirements, while providing a versatile foundation for
phenotyping, sensing, and other agriculture applications. Design files and
implementation details are released to accelerate research and adoption of
modular agricultural robotics.
comment: GitHub: https://github.com/structuresComp/agri-cruiser
♻ ☆ Sequence Pathfinder for Multi-Agent Pickup and Delivery in the Warehouse
Multi-Agent Pickup and Delivery (MAPD) is a challenging extension of
Multi-Agent Path Finding (MAPF), where agents are required to sequentially
complete tasks with fixed-location pickup and delivery demands. Although
learning-based methods have made progress in MAPD, they often perform poorly in
warehouse-like environments with narrow pathways and long corridors when
relying only on local observations for distributed decision-making.
Communication learning can alleviate the lack of global information but
introduce high computational complexity due to point-to-point communication. To
address this challenge, we formulate MAPF as a sequence modeling problem and
prove that path-finding policies under sequence modeling possess
order-invariant optimality, ensuring its effectiveness in MAPD. Building on
this, we propose the Sequential Pathfinder (SePar), which leverages the
Transformer paradigm to achieve implicit information exchange, reducing
decision-making complexity from exponential to linear while maintaining
efficiency and global awareness. Experiments demonstrate that SePar
consistently outperforms existing learning-based methods across various MAPF
tasks and their variants, and generalizes well to unseen environments.
Furthermore, we highlight the necessity of integrating imitation learning in
complex maps like warehouses.
comment: Preprint Under Review
♻ ☆ Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models
Kento Murata, Shoichi Hasegawa, Tomochika Ishikawa, Yoshinobu Hagiwara, Akira Taniguchi, Lotfi El Hafi, Tadahiro Taniguchi
It is crucial to efficiently execute instructions such as "Find an apple and
a banana" or "Get ready for a field trip," which require searching for multiple
objects or understanding context-dependent commands. This study addresses the
challenging problem of determining which robot should be assigned to which part
of a task when each robot possesses different situational on-site
knowledge-specifically, spatial concepts learned from the area designated to it
by the user. We propose a task planning framework that leverages large language
models (LLMs) and spatial concepts to decompose natural language instructions
into subtasks and allocate them to multiple robots. We designed a novel
few-shot prompting strategy that enables LLMs to infer required objects from
ambiguous commands and decompose them into appropriate subtasks. In our
experiments, the proposed method achieved 47/50 successful assignments,
outperforming random (28/50) and commonsense-based assignment (26/50).
Furthermore, we conducted qualitative evaluations using two actual mobile
manipulators. The results demonstrated that our framework could handle
instructions, including those involving ad hoc categories such as "Get ready
for a field trip," by successfully performing task decomposition, assignment,
sequential planning, and execution.
comment: Submitted to AROB-ISBC 2026 (Journal Track option)
♻ ☆ SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
Humans continuously infer the states, goals, and behaviors of others by
perceiving their surroundings in dynamic, real-world social interactions.
However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based
scenarios, which have a significant gap compared to real interactions. We
propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in
embodied multi-agent complex social interactions. This benchmark is based on
rich multimodal interaction data generated by the interaction environment SoMi,
covering diverse crafting goals and social relationships. Our framework
supports multi-level evaluation: (1) first-person evaluation provides
multimodal (visual, dialogue, action, etc.) input from a first-person
perspective during a task for real-time state inference, (2) third-person
evaluation provides complete third-person perspective video and text records
after a task for goal and behavior inference. This evaluation method allows for
a more comprehensive examination of a model's ToM capabilities from both the
subjective immediate experience and the objective global observation. We
constructed a challenging dataset containing 35 third-person perspective
videos, 363 first-person perspective images, and 1225 expert-annotated
multiple-choice questions (three options). On this dataset, we systematically
evaluated the performance of human subjects and several state-of-the-art large
vision-language models (LVLMs). The results show that LVLMs perform
significantly worse than humans on SoMi-ToM: the average accuracy gap between
humans and models is 40.1% in first-person evaluation and 26.4% in third-person
evaluation. This indicates that future LVLMs need to further improve their ToM
capabilities in embodied, complex social interactions.
comment: 24 pages, 6 figures
♻ ☆ ADPro: a Test-time Adaptive Diffusion Policy via Manifold-constrained Denoising and Task-aware Initialization for Robotic Manipulation
Diffusion policies have recently emerged as a powerful class of visuomotor
controllers for robot manipulation, offering stable training and expressive
multi-modal action modeling. However, existing approaches typically treat
action generation as an unconstrained denoising process, ignoring valuable a
priori knowledge about geometry and control structure. In this work, we propose
the Adaptive Diffusion Policy (ADP), a test-time adaptation method that
introduces two key inductive biases into the diffusion. First, we embed a
geometric manifold constraint that aligns denoising updates with task-relevant
subspaces, leveraging the fact that the relative pose between the end-effector
and target scene provides a natural gradient direction, and guiding denoising
along the geodesic path of the manipulation manifold. Then, to reduce
unnecessary exploration and accelerate convergence, we propose an analytically
guided initialization: rather than sampling from an uninformative prior, we
compute a rough registration between the gripper and target scenes to propose a
structured initial noisy action. ADP is compatible with pre-trained diffusion
policies and requires no retraining, enabling test-time adaptation that tailors
the policy to specific tasks, thereby enhancing generalization across novel
tasks and environments. Experiments on RLBench, CALVIN, and real-world datasets
show that ADPro, an implementation of ADP, improves success rates,
generalization, and sampling efficiency, achieving up to 25% faster execution
and 9% points over strong diffusion baselines.
♻ ★ LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving
Lingdong Kong, Xiang Xu, Youquan Liu, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu
Recent advancements in vision foundation models (VFMs) have revolutionized
visual perception in 2D, yet their potential for 3D scene understanding,
particularly in autonomous driving applications, remains underexplored. In this
paper, we introduce LargeAD, a versatile and scalable framework designed for
large-scale 3D pretraining across diverse real-world driving datasets. Our
framework leverages VFMs to extract semantically rich superpixels from 2D
images, which are aligned with LiDAR point clouds to generate high-quality
contrastive samples. This alignment facilitates cross-modal representation
learning, enhancing the semantic consistency between 2D and 3D data. We
introduce several key innovations: (i) VFM-driven superpixel generation for
detailed semantic representation, (ii) a VFM-assisted contrastive learning
strategy to align multimodal features, (iii) superpoint temporal consistency to
maintain stable representations across time, and (iv) multi-source data
pretraining to generalize across various LiDAR configurations. Our approach
achieves substantial gains over state-of-the-art methods in linear probing and
fine-tuning for LiDAR-based segmentation and object detection. Extensive
experiments on 11 large-scale multi-sensor datasets highlight our superior
performance, demonstrating adaptability, efficiency, and robustness in
real-world autonomous driving scenarios.
comment: IEEE TPAMI 2025; 17 pages, 9 figures, 11 tables; Project Page at
https://ldkong.com/LargeAD
♻ ☆ OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata NeurIPS 2025
Accurate visual localization from aerial views is a fundamental problem with
applications in mapping, large-area inspection, and search-and-rescue
operations. In many scenarios, these systems require high-precision
localization while operating with limited resources (e.g., no internet
connection or GNSS/GPS support), making large image databases or heavy 3D
models impractical. Surprisingly, little attention has been given to leveraging
orthographic geodata as an alternative paradigm, which is lightweight and
increasingly available through free releases by governmental authorities (e.g.,
the European Union). To fill this gap, we propose OrthoLoC, the first
large-scale dataset comprising 16,425 UAV images from Germany and the United
States with multiple modalities. The dataset addresses domain shifts between
UAV imagery and geospatial data. Its paired structure enables fair benchmarking
of existing solutions by decoupling image retrieval from feature matching,
allowing isolated evaluation of localization and calibration performance.
Through comprehensive evaluation, we examine the impact of domain shifts, data
resolutions, and covisibility on localization accuracy. Finally, we introduce a
refinement technique called AdHoP, which can be integrated with any feature
matcher, improving matching by up to 95% and reducing translation error by up
to 63%. The dataset and code are available at:
https://deepscenario.github.io/OrthoLoC.
comment: Accepted at NeurIPS 2025
♻ ☆ Simulated Annealing for Multi-Robot Ergodic Information Acquisition Using Graph-Based Discretization
One of the goals of active information acquisition using multi-robot teams is
to keep the relative uncertainty in each region at the same level to maintain
identical acquisition quality (e.g., consistent target detection) in all the
regions. To achieve this goal, ergodic coverage can be used to assign the
number of samples according to the quality of observation, i.e., sampling noise
levels. However, the noise levels are unknown to the robots. Although this
noise can be estimated from samples, the estimates are unreliable at first and
can generate fluctuating values. The main contribution of this paper is to use
simulated annealing to generate the target sampling distribution, starting from
uniform and gradually shifting to an estimated optimal distribution, by varying
the coldness parameter of a Boltzmann distribution with the estimated sampling
entropy as energy. Simulation results show a substantial improvement of both
transient and asymptotic entropy compared to both uniform and direct-ergodic
searches. Finally, a demonstration is performed with a TurtleBot swarm system
to validate the physical applicability of the algorithm.
♻ ☆ Towards autonomous photogrammetric forest inventory using a lightweight under-canopy robotic drone
Väinö Karjalainen, Niko Koivumäki, Teemu Hakala, Jesse Muhojoki, Eric Hyyppä, Anand George, Juha Suomalainen, Eija Honkavaara
Drones are increasingly used in forestry to capture high-resolution remote
sensing data, supporting enhanced monitoring, assessment, and decision-making
processes. While operations above the forest canopy are already highly
automated, flying inside forests remains challenging, primarily relying on
manual piloting. In dense forests, relying on the Global Navigation Satellite
System (GNSS) for localization is not feasible. In addition, the drone must
autonomously adjust its flight path to avoid collisions. Recently, advancements
in robotics have enabled autonomous drone flights in GNSS-denied obstacle-rich
areas. In this article, a step towards autonomous forest data collection is
taken by building a prototype of a robotic under-canopy drone utilizing
state-of-the-art open source methods and validating its performance for data
collection inside forests. Specifically, the study focused on camera-based
autonomous flight under the forest canopy and photogrammetric post-processing
of the data collected with the low-cost onboard stereo camera. The autonomous
flight capability of the prototype was evaluated through multiple test flights
in boreal forests. The tree parameter estimation capability was studied by
performing diameter at breast height (DBH) estimation. The prototype
successfully carried out flights in selected challenging forest environments,
and the experiments showed promising performance in forest 3D modeling with a
miniaturized stereoscopic photogrammetric system. The DBH estimation achieved a
root mean square error (RMSE) of 3.33 - 3.97 cm (10.69 - 12.98 %) across all
trees. For trees with a DBH less than 30 cm, the RMSE was 1.16 - 2.56 cm (5.74
- 12.47 %). The results provide valuable insights into autonomous under-canopy
forest mapping and highlight the critical next steps for advancing lightweight
robotic drone systems for mapping complex forest environments.
comment: 35 pages, 11 Figures
♻ ☆ Track Any Motions under Any Disturbances
Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, Huaping Liu, He Wang, Li Yi
A foundational humanoid motion tracker is expected to be able to track
diverse, highly dynamic, and contact-rich motions. More importantly, it needs
to operate stably in real-world scenarios against various dynamics
disturbances, including terrains, external forces, and physical property
changes for general practical use. To achieve this goal, we propose Any2Track
(Track Any motions under Any disturbances), a two-stage RL framework to track
various motions under multiple disturbances in the real world. Any2Track
reformulates dynamics adaptability as an additional capability on top of basic
action execution and consists of two key components: AnyTracker and AnyAdapter.
AnyTracker is a general motion tracker with a series of careful designs to
track various motions within a single policy. AnyAdapter is a history-informed
adaptation module that endows the tracker with online dynamics adaptability to
overcome the sim2real gap and multiple real-world disturbances. We deploy
Any2Track on Unitree G1 hardware and achieve a successful sim2real transfer in
a zero-shot manner. Any2Track performs exceptionally well in tracking various
motions under multiple real-world disturbances.
♻ ☆ Long-Horizon Visual Imitation Learning via Plan and Code Reflection
Learning from long-horizon demonstrations with complex action sequences
presents significant challenges for visual imitation learning, particularly in
understanding temporal relationships of actions and spatial relationships
between objects. In this paper, we propose a new agent framework that
incorporates two dedicated reflection modules to enhance both plan and code
generation. The plan generation module produces an initial action sequence,
which is then verified by the plan reflection module to ensure temporal
coherence and spatial alignment with the demonstration video. The code
generation module translates the plan into executable code, while the code
reflection module verifies and refines the generated code to ensure correctness
and consistency with the generated plan. These two reflection modules jointly
enable the agent to detect and correct errors in both the plan generation and
code generation, improving performance in tasks with intricate temporal and
spatial dependencies. To support systematic evaluation, we introduce
LongVILBench, a benchmark comprising 300 human demonstrations with action
sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial
complexity across multiple task types. Experimental results demonstrate that
existing methods perform poorly on this benchmark, whereas our new framework
establishes a strong baseline for long-horizon visual imitation learning.
comment: 9 pages, 4 figures
♻ ☆ WorldGym: World Model as An Environment for Policy Evaluation
Evaluating robot control policies is difficult: real-world testing is costly,
and handcrafted simulators require manual effort to improve in realism and
generality. We propose a world-model-based policy evaluation environment
(WorldGym), an autoregressive, action-conditioned video generation model which
serves as a proxy to real world environments. Policies are evaluated via Monte
Carlo rollouts in the world model, with a vision-language model providing
rewards. We evaluate a set of VLA-based real-robot policies in the world model
using only initial frames from real robots, and show that policy success rates
within the world model highly correlate with real-world success rates.
Moreoever, we show that WorldGym is able to preserve relative policy rankings
across different policy versions, sizes, and training checkpoints. Due to
requiring only a single start frame as input, the world model further enables
efficient evaluation of robot policies' generalization ability on novel tasks
and environments. We find that modern VLA-based robot policies still struggle
to distinguish object shapes and can become distracted by adversarial facades
of objects. While generating highly realistic object interaction remains
challenging, WorldGym faithfully emulates robot motions and offers a practical
starting point for safe and reproducible policy evaluation before deployment.
comment: https://world-model-eval.github.io