MyArxiv
Robotics 52
☆ Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.
comment: 10 pages, 6 figures
☆ Deep Neural Network Based Roadwork Detection for Autonomous Driving
Road construction sites create major challenges for both autonomous vehicles and human drivers due to their highly dynamic and heterogeneous nature. This paper presents a real-time system that detects and localizes roadworks by combining a YOLO neural network with LiDAR data. The system identifies individual roadwork objects while driving, merges them into coherent construction sites and records their outlines in world coordinates. The model training was based on an adapted US dataset and a new dataset collected from test drives with a prototype vehicle in Berlin, Germany. Evaluations on real-world road construction sites showed a localization accuracy below 0.5 m. The system can support traffic authorities with up-to-date roadwork data and could enable autonomous vehicles to navigate construction sites more safely in the future.
comment: 7 pages, 10 figures
☆ Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.
comment: 15 pages, 5 figues, 2 tables. This work has been submitted to the IEEE for possible publication
☆ A virtual-variable-length method for robust inverse kinematics of multi-segment continuum robots
This paper proposes a new, robust method to solve the inverse kinematics (IK) of multi-segment continuum manipulators. Conventional Jacobian-based solvers, especially when initialized from neutral/rest configurations, often exhibit slow convergence and, in certain conditions, may fail to converge (deadlock). The Virtual-Variable-Length (VVL) method proposed here introduces fictitious variations of segments' length during the solution iteration, conferring virtual axial degrees of freedom that alleviate adverse behaviors and constraints, thus enabling or accelerating convergence. Comprehensive numerical experiments were conducted to compare the VVL method against benchmark Jacobian-based and Damped Least Square IK solvers. Across more than $1.8\times 10^6$ randomized trials covering manipulators with two to seven segments, the proposed approach achieved up to a 20$\%$ increase in convergence success rate over the benchmark and a 40-80$\%$ reduction in average iteration count under equivalent accuracy thresholds ($10^{-4}-10^{-8}$). While deadlocks are not restricted to workspace boundaries and may occur at arbitrary poses, our empirical study identifies boundary-proximal configurations as a frequent cause of failed convergence and the VVL method mitigates such occurrences over a statistical sample of test cases.
comment: 8 pages, 6 figures, accepted for presentation in IEEE RoboSoft 2026, Kanazawa, Japan
☆ UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.
☆ UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
comment: code has been released at https://github.com/xiaomi-research/unidrivevla
☆ PRO-SPECT: Probabilistically Safe Scalable Planning for Energy-Aware Coordinated UAV-UGV Teams in Stochastic Environments
We consider energy-aware planning for an unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) team operating in a stochastic environment. The UAV must visit a set of air points in minimum time while respecting energy constraints, relying on the UGV as a mobile charging station. Unlike prior work that assumed deterministic travel times or used fixed robustness margins, we model travel times as random variables and bound the probability of failure (energy depletion) across the entire mission to a user-specified risk level. We formulate the problem as a Mixed-Integer Program and propose PRO-SPECT, a polynomial-time algorithm that generates risk-bounded plans. The algorithm supports both offline planning and online re-planning, enabling the team to adapt to disturbances while preserving the risk bound. We provide theoretical results on solution feasibility and time complexity. We also demonstrate the performance of our method via numerical comparisons and simulations.
☆ ROS 2-Based LiDAR Perception Framework for Mobile Robots in Dynamic Production Environments, Utilizing Synthetic Data Generation, Transformation-Equivariant 3D Detection and Multi-Object Tracking
Adaptive robots in dynamic production environments require robust perception capabilities, including 6D pose estimation and multi-object tracking. To address limitations in real-world data dependency, noise robustness, and spatiotemporal consistency, a LiDAR framework based on the Robot Operating System integrating a synthetic-data-trained Transformation-Equivariant 3D Detection with multi-object-tracking leveraging center poses is proposed. Validated across 72 scenarios with motion capture technology, overall results yield an Intersection over Union of 62.6% for standalone pose estimation, rising to 83.12% with multi-object-tracking integration. Our LiDAR-based framework achieves 91.12% of Higher Order Tracking Accuracy, advancing robustness and versatility of LiDAR-based perception systems for industrial mobile manipulators.
comment: Accepted for publication at CIRP ICME 2025; will appear in Procedia CIRP
☆ Cross-Modal Visuo-Tactile Object Perception
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.
comment: 23 pages, 8 figures, 1 table. Submitted for review to journal
☆ HyVGGT-VO: Tightly Coupled Hybrid Dense Visual Odometry with Feed-Forward Models
Dense visual odometry (VO), which provides pose estimation and dense 3D reconstruction, serves as the cornerstone for applications ranging from robotics to augmented reality. Recently, feed-forward models have demonstrated remarkable capabilities in dense mapping. However, when these models are used in dense visual SLAM systems, their heavy computational burden restricts them to yielding sparse pose outputs at keyframes while still failing to achieve real-time pose estimation. In contrast, traditional sparse methods provide high computational efficiency and high-frequency pose outputs, but lack the capability for dense reconstruction. To address these limitations, we propose HyVGGT-VO, a novel framework that combines the computational efficiency of sparse VO with the dense reconstruction capabilities of feed-forward models. To the best of our knowledge, this is the first work to tightly couple a traditional VO framework with VGGT, a state-of-the-art feed-forward model. Specifically, we design an adaptive hybrid tracking frontend that dynamically switches between traditional optical flow and the VGGT tracking head to ensure robustness. Furthermore, we introduce a hierarchical optimization framework that jointly refines VO poses and the scale of VGGT predictions to ensure global scale consistency. Our approach achieves an approximately 5x processing speedup compared to existing VGGT-based methods, while reducing the average trajectory error by 85% on the indoor EuRoC dataset and 12% on the outdoor KITTI benchmark. Our code will be publicly available upon acceptance. Project page: https://geneta2580.github.io/HyVGGT-VO.io.
☆ CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
comment: Code available at: github.com/Lorenzo-0-0/CompassAD
☆ O-ConNet: Geometry-Aware End-to-End Inference of Over-Constrained Spatial Mechanisms
Deep learning has shown strong potential for scientific discovery, but its ability to model macroscopic rigid-body kinematic constraints remains underexplored. We study this problem on spatial over-constrained mechanisms and propose O-ConNet, an end-to-end framework that infers mechanism structural parameters from only three sparse reachable points while reconstructing the full motion trajectory, without explicitly solving constraint equations during inference. On a self-constructed Bennett 4R dataset of 42,860 valid samples, O-ConNet achieves Param-MAE 0.276 +/- 0.077 and Traj-MAE 0.145 +/- 0.018 (mean +/- std over 10 runs), outperforming the strongest sequence baseline (LSTM-Seq2Seq) by 65.1 percent and 88.2 percent, respectively. These results suggest that end-to-end learning can capture closed-loop geometric structure and provide a practical route for inverse design of spatial over-constrained mechanisms under extremely sparse observations.
comment: 8 pages, 5 figures
☆ Bridging Discrete Planning and Continuous Execution for Redundant Robot
Voxel-grid reinforcement learning is widely adopted for path planning in redundant manipulators due to its simplicity and reproducibility. However, direct execution through point-wise numerical inverse kinematics on 7-DoF arms often yields step-size jitter, abrupt joint transitions, and instability near singular configurations. This work proposes a bridging framework between discrete planning and continuous execution without modifying the discrete planner itself. On the planning side, step-normalized 26-neighbor Cartesian actions and a geometric tie-breaking mechanism are introduced to suppress unnecessary turns and eliminate step-size oscillations. On the execution side, a task-priority damped least-squares (TP-DLS) inverse kinematics layer is implemented. This layer treats end-effector position as a primary task, while posture and joint centering are handled as subordinate tasks projected into the null space, combined with trust-region clipping and joint velocity constraints. On a 7-DoF manipulator in random sparse, medium, and dense environments, this bridge raises planning success in dense scenes from about 0.58 to 1.00, shortens representative path length from roughly 1.53 m to 1.10 m, and while keeping end-effector error below 1 mm, reduces peak joint accelerations by over an order of magnitude, substantially improving the continuous execution quality of voxel-based RL paths on redundant manipulators.
comment: 8 pages, 3 figures. Submitted to IFAC World Congress 2026
☆ Integrated Identification of Collaborative Robots for Robot Assisted 3D Printing Processes
In recent years, the integration of additive manufacturing (AM) and industrial robotics has opened new perspectives for the production of complex components, particularly in the automotive sector. Robot-assisted additive manufacturing processes overcome the dimensional and kinematic limitations of traditional Cartesian systems, enabling non-planar deposition and greater geometric flexibility. However, the increasing dynamic complexity of robotic manipulators introduces challenges related to precision, control, and error prediction. This work proposes a model-based approach equipped with an integrated identification procedure of the system's parameters, including the robot, the actuators and the controllers. We show that the integrated modeling procedure allows to obtain a reliable dynamic model even in the presence of sensory and programming limitations typical of collaborative robots. The manipulator's dynamic model is identified through an integrated five step methodology: starting with geometric and inertial analysis, followed by friction and controller parameters identification, all the way to the remaining parameters identification. The proposed procedure intrinsically ensures the physical consistency of the identified parameters. The identification approach is validated on a real world case study involving a 6-Degrees-Of-Freedom (DoFs) collaborative robot used in a thermoplastic extrusion process. The very good matching between the experimental results given by actual robot and those given by the identified model shows the potential enhancement of precision, control, and error prediction in Robot Assisted 3D Printing Processes.
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
comment: Project Website: https://world-action-verifier.github.io
☆ Ego-Grounding for Personalized Question-Answering in Egocentric Videos CVPR'26
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
comment: To appear at CVPR'26
☆ Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
☆ Global Geometry of Orthogonal Foliations in the Control Allocation of Signed-Quadratic Systems
This work formalizes the differential topology of redundancy resolution for systems governed by signed-quadratic actuation maps. By analyzing the minimally redundant case, the global topology of the continuous fiber bundle defining the nonlinear actuation null-space is established. The distribution orthogonal to these fibers is proven to be globally integrable and governed by an exact logarithmic potential field. This field foliates the actuator space, inducing a structural stratification of all orthants into transverse layers whose combinatorial sizes follow a strictly binomial progression. Within these layers, adjacent orthants are continuously connected via lower-dimensional strata termed reciprocal hinges, while the layers themselves are separated by boundary hyperplanes, or portals, that act as global sections of the fibers. This partition formally distinguishes extremal and transitional layers, which exhibit fundamentally distinct fiber topologies and foliation properties. Through this geometric framework, classical pseudo-linear static allocation strategies are shown to inevitably intersect singular boundary hyperplanes, triggering infinite-derivative kinetic singularities and fragmenting the task space into an exponential number of singularity-separated sectors. In contrast, allocators derived from the orthogonal manifolds yield continuously differentiable global sections with only a linear number of sectors for transversal layers, or can even form a single global diffeomorphism to the task space in the case of the two extremal layers, thus completely avoiding geometric rank-loss and boundary-crossing singularities. These theoretical results directly apply to the control allocation of propeller-driven architectures, including multirotor UAVs, marine, and underwater vehicles.
comment: Multimedia material attached
☆ Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks. Videos are available at our project website https://cccedric.github.io/poco/.
☆ Preferential Bayesian Optimization with Crash Feedback
Bayesian optimization is a popular black-box optimization method for parameter learning in control and robotics. It typically requires an objective function that reflects the user's optimization goal. However, in practical applications, this objective function is often inaccessible due to complex or unmeasurable performance metrics. Preferential Bayesian optimization (PBO) overcomes this limitation by leveraging human feedback through pairwise comparisons, eliminating the need for explicit performance quantification. When applying PBO to hardware systems, such as in quadcopter control, crashes can cause time-consuming experimental resets, wear and tear, or otherwise undesired outcomes. Standard PBO methods cannot incorporate feedback from such crashed experiments, resulting in the exploration of parameters that frequently lead to experimental crashes. We thus introduce CrashPBO, a user-friendly mechanism that enables users to both express preferences and report crashes during the optimization process. Benchmarking on synthetic functions shows that this mechanism reduces crashes by 63% and increases data efficiency. Through experiments on three robotics platforms, we demonstrate the wide applicability and transferability of CrashPBO, highlighting that it provides a flexible, user-friendly framework for parameter learning with human feedback on preferences and crashes.
☆ DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
comment: 11 pages, 4 figures; Project Website: https://drivedreamer-policy.github.io/
☆ Realistic Lip Motion Generation Based on 3D Dynamic Viseme and Coarticulation Modeling for Human-Robot Interaction
Realistic lip synchronization is essential for the natural human-robot non-verbal interaction of humanoid robots. Motivated by this need, this paper presents a lip motion generation framework based on 3D dynamic viseme and coarticulation modeling. By analyzing Chinese pronunciation theory, a 3D dynamic viseme library is constructed based on the ARKit standard, which offers coherent prior trajectories of lips. To resolve motion conflicts within continuous speech streams, a coarticulation mechanism is developed by incorporating initial-final (Shengmu-Yunmu) decoupling and energy modulation. After developing a strategy to retarget high-dimensional spatial lip motion to a 14-DOF lip actuation system of a humanoid head platform, the efficiency and accuracy of the proposed architecture is experimentally validated and demonstrated with quantitative ablation experiments using the metrics of the Pearson Correlation Coefficient (PCC) and the Mean Absolute Jerk (MAJ). This research offers a lightweight, efficient, and highly practical paradigm for the speech-driven lip motion generation of humanoid robots. The 3D dynamic viseme library and real-world deployment videos are available at {https://github.com/yuesheng21/Phoneme-to-Lip-14DOF}
comment: 8 pages,7 figures
☆ Analysis of Efficient Transmission Methods of Grid Maps for Intelligent Vehicles
Grid mapping is a fundamental approach to modeling the environment of intelligent vehicles or robots. Compared with object-based environment modeling, grid maps offer the distinct advantage of representing the environment without requiring any assumptions about objects, such as type or shape. For grid-map-based approaches, the environment is divided into cells, each containing information about its respective area, such as occupancy. This representation of the entire environment is crucial for achieving higher levels of autonomy. However, it has the drawback that modeling the scene at the cell level results in inherently large data sizes. Patched grid maps tackle this issue to a certain extent by adapting cell sizes in specific areas. Nevertheless, the data sizes of patched grid maps are still too large for novel distributed processing setups or vehicle-to-everything (V2X) applications. Our work builds on a patch-based grid-map approach and investigates the size problem from a communication perspective. To address this, we propose a patch-based communication pipeline that leverages existing compression algorithms to transmit grid-map data efficiently. We provide a comprehensive analysis of this pipeline for both intra-vehicle and V2X-based communication. The analysis is verified for these use cases with two real-world experiment setups. Finally, we summarize recommended guidelines for the efficient transmission of grid-map data in intelligent transportation systems.
comment: Accepted for 2026 IEEE Intelligent Vehicles Symposium (IV) - DOI will be added after publication
☆ Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.
comment: 18 pages, 6 figures, 4 tables
☆ Hi-LOAM: Hierarchical Implicit Neural Fields for LiDAR Odometry and Mapping
LiDAR Odometry and Mapping (LOAM) is a pivotal technique for embodied-AI applications such as autonomous driving and robot navigation. Most existing LOAM frameworks are either contingent on the supervision signal, or lack of the reconstruction fidelity, which are deficient in depicting details of large-scale complex scenes. To overcome these limitations, we propose a multi-scale implicit neural localization and mapping framework using LiDAR sensor, called Hi-LOAM. Hi-LOAM receives LiDAR point cloud as the input data modality, learns and stores hierarchical latent features in multiple levels of hash tables based on an octree structure, then these multi-scale latent features are decoded into signed distance value through shallow Multilayer Perceptrons (MLPs) in the mapping procedure. For pose estimation procedure, we rely on a correspondence-free, scan-to-implicit matching paradigm to estimate optimal pose and register current scan into the submap. The entire training process is conducted in a self-supervised manner, which waives the model pre-training and manifests its generalizability when applied to diverse environments. Extensive experiments on multiple real-world and synthetic datasets demonstrate the superior performance, in terms of the effectiveness and generalization capabilities, of our Hi-LOAM compared to existing state-of-the-art methods.
comment: This manuscript is the accepted version of IEEE Transactions on Multimedia
☆ OpenGo: An OpenClaw-Based Robotic Dog with Real-Time Skill Switching
Adaptation to complex tasks and multiple scenarios remains a significant challenge for a single robot agent. The ability to acquire organize, and switch between a wide range of skills in real time, particularly in dynamic environments, has become a fundamental requirement for embodied intelligence. We introduce OpenGo, an OpenClaw-powered embodied robotic dog capable of switching skills in real time according to the scene and task instructions. Specifically, the agent is equipped with (1) a customizable skill library with easy skill import and autonomous skill validation, (2) a dispatcher that selects and invokes different skills according to task prompts or language instructions, and (3) a self-learning framework that fine-tunes skills based on task completion and human feedback. We deploy the agent in Unitree's Go2 robotic dog and validate its capabilities in self-checking and switching of skills autonomously. In addition, by integrating Feishu-platform communication, we enable natural-language guidance and human feedback, allowing inexperienced users to control the robotic dog through simple instructions.
comment: 11 pages, 6 figures
☆ 3-D Relative Localization for Multi-Robot Systems with Angle and Self-Displacement Measurements
Realizing relative localization by leveraging inter-robot local measurements is a challenging problem, especially in the presence of measurement noise. Motivated by this challenge, in this paper we propose a novel and systematic 3-D relative localization framework based on inter-robot interior angle and self-displacement measurements. Initially, we propose a linear relative localization theory comprising a distributed linear relative localization algorithm and sufficient conditions for localizability. According to this theory, robots can determine their neighbors' relative positions and orientations in a purely linear manner. Subsequently, in order to deal with measurement noise, we present an advanced Maximum a Posterior (MAP) estimator by addressing three primary challenges existing in the MAP estimator. Firstly, it is common to formulate the MAP problem as an optimization problem, whose inherent non-convexity can result in local optima. To address this issue, we reformulate the linear computation process of the linear relative localization algorithm as a Weighted Total Least Squares (WTLS) optimization problem on manifolds. The optimal solution of the WTLS problem is more accurate, which can then be used as initial values when solving the optimization problem associated with the MAP problem, thereby reducing the risk of falling into local optima. The second challenge is the lack of knowledge of the prior probability density of the robots' relative positions and orientations at the initial time, which is required as an input for the MAP estimator. To deal with it, we combine the WTLS with a Neural Density Estimator (NDE). Thirdly, to prevent the increasing size of the relative positions and orientations to be estimated as the robots continuously move when solving the MAP problem, a marginalization mechanism is designed, which ensures that the computational cost remains constant.
comment: 29 pages, 28 figures
☆ A Graph Neural Network Approach for Solving the Ranked Assignment Problem in Multi-Object Tracking
Associating measurements with tracks is a crucial step in Multi-Object Tracking (MOT) to guarantee the safety of autonomous vehicles. To manage the exponentially growing number of track hypotheses, truncation becomes necessary. In the $δ$-Generalized Labeled Multi-Bernoulli ($δ$-GLMB) filter application, this truncation typically involves the ranked assignment problem, solved by Murty's algorithm or the Gibbs sampling approach, both with limitations in terms of complexity or accuracy, respectively. With the motivation to improve these limitations, this paper addresses the ranked assignment problem arising from data association tasks with an approach that employs Graph Neural Networks (GNNs). The proposed Ranked Assignment Prediction Graph Neural Network (RAPNet) uses bipartite graphs to model the problem, harnessing the computational capabilities of deep learning. The conclusive evaluation compares the RAPNet with Murty's algorithm and the Gibbs sampler, showing accuracy improvements compared to the Gibbs sampler.
comment: 2024 IEEE Intelligent Vehicles Symposium (IV)
☆ Bridging Large-Model Reasoning and Real-Time Control via Agentic Fast-Slow Planning
Large foundation models enable powerful reasoning for autonomous systems, but mapping semantic intent to reliable real-time control remains challenging. Existing approaches either (i) let Large Language Models (LLMs) generate trajectories directly - brittle, hard to verify, and latency-prone - or (ii) adjust Model Predictive Control (MPC) objectives online - mixing slow deliberation with fast control and blurring interfaces. We propose Agentic Fast-Slow Planning, a hierarchical framework that decouples perception, reasoning, planning, and control across natural timescales. The framework contains two bridges. Perception2Decision compresses scenes into ego-centric topologies using an on-vehicle Vision-Language Model (VLM) detector, then maps them to symbolic driving directives in the cloud with an LLM decision maker - reducing bandwidth and delay while preserving interpretability. Decision2Trajectory converts directives into executable paths: Semantic-Guided A* embeds language-derived soft costs into classical search to bias solutions toward feasible trajectories, while an Agentic Refinement Module adapts planner hyperparameters using feedback and memory. Finally, MPC tracks the trajectories in real time, with optional cloud-guided references for difficult cases. Experiments in CARLA show that Agentic Fast-Slow Planning improves robustness under perturbations, reducing lateral deviation by up to 45% and completion time by over 12% compared to pure MPC and an A*-guided MPC baseline. Code is available at https://github.com/cjychenjiayi/icra2026_AFSP.
comment: 8 pages, 12figures
☆ AURA: Multimodal Shared Autonomy for Real-World Urban Navigation
Long-horizon navigation in complex urban environments relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate within the same action space, leading to high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align various human instructions with visual and spatial context. To facilitate training, we construct MM-CoS, a large-scale dataset comprising teleoperation and vision-language descriptions. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation. Moreover, under similar takeover conditions, our shared autonomy framework reduces the frequency of takeovers by more than 44%. Demo video and more detail are provided in the project page.
comment: 17 pages, 18 figures, 4 tables, conference
☆ Smooth Feedback Motion Planning with Reduced Curvature
Feedback motion planning over cell decompositions provides a robust method for generating collision-free robot motion with formal guarantees. However, existing algorithms often produce paths with unnecessary bending, leading to slower motion and higher control effort. This paper presents a computationally efficient method to mitigate this issue for a given simplicial decomposition. A heuristic is introduced that systematically aligns and assigns local vector fields to produce more direct trajectories, complemented by a novel geometric algorithm that constructs a maximal star-shaped chain of simplexes around the goal. This creates a large ``funnel'' in which an optimal, direct-to-goal control law can be safely applied. Simulations demonstrate that our method generates measurably more direct paths, reducing total bending by an average of 91.40\% and LQR control effort by an average of 45.47\%. Furthermore, comparative analysis against sampling-based and optimization-based planners confirms the time efficacy and robustness of our approach. While the proposed algorithms work over any finite-dimensional simplicial complex embedded in the collision-free subset of the configuration space, the practical application focuses on low-dimensional ($d\le3$) configuration spaces, where simplicial decomposition is computationally tractable.
comment: Accepted for publication in IEEE Robotics and Automation Letters
☆ F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling CVPR 2026
We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.
comment: Accepted to the CVPR 2026 SPAR-3D Workshop
☆ Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior CVPR 2026
In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property, thus leading to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation.
comment: Accepted by CVPR 2026
☆ AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation
A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action distributions rather than collapsing to one solution. But in practice, full iterative denoising is costly at control time. Action chunking helps amortize inference, yet it also creates partially open-loop behavior, allowing small mismatches to accumulate into drift. We present AnchorVLA, a diffusion-based VLA policy for mobile manipulation built on the core insight that when sampling begins near a plausible solution manifold, extensive denoising is unnecessary to recover multimodal, valid actions. AnchorVLA combines a lightweight VLA adaptation backbone with an anchored diffusion action head, which denoises locally around anchor trajectories using a truncated diffusion schedule. This retains multimodal action generation while reducing inference cost for closed-loop control. Crucially, to mitigate chunking-induced drift, we introduce a test-time self-correction mechanism via a lightweight residual correction module that makes high-frequency, per-step adjustments during rollout. Across diverse mobile manipulation tasks, AnchorVLA improves success and stability under disturbances and distribution shifts while maintaining low-latency inference. The source code is made available at https://github.com/jason-lim26/AnchorVLA.
☆ Robust Autonomous Control of a Magnetic Millirobot in In Vitro Cardiac Flow
Untethered magnetic millirobots offer significant potential for minimally invasive cardiac therapies; however, achieving reliable autonomous control in pulsatile cardiac flow remains challenging. This work presents a vision-guided control framework enabling precise autonomous navigation of a magnetic millirobot in an in vitro heart phantom under physiologically relevant flow conditions. The system integrates UNet-based localization, A* path planning, and a sliding mode controller with a disturbance observer (SMC-DOB) designed for multi-coil electromagnetic actuation. Although drag forces are estimated using steady-state CFD simulations, the controller compensates for transient pulsatile disturbances during closed-loop operation. In static fluid, the SMC-DOB achieved sub-millimeter accuracy (root-mean-square error, RMSE = 0.49 mm), outperforming PID and MPC baselines. Under moderate pulsatile flow (7 cm/s peak, 20 cP), it reduced RMSE by 37% and peak error by 2.4$\times$ compared to PID. It further maintained RMSE below 2 mm (0.27 body lengths) under elevated pulsatile flow (10 cm/s peak, 20 cP) and under low-viscosity conditions (4.3 cP, 7 cm/s peak), where baseline controllers exhibited unstable or failed tracking. These results demonstrate robust closed-loop magnetic control under time-varying cardiac flow disturbances and support the feasibility of autonomous millirobot navigation for targeted drug delivery.
☆ MorphoGuard: A Morphology-Based Whole-Body Interactive Motion Controller
Whole-body control (WBC) has demonstrated significant advantages in complex interactive movements of high-dimensional robotic systems. However, when a robot is required to handle dynamic multi-contact combinations along a single kinematic chain-such as pushing open a door with its elbow while grasping an object-it faces major obstacles in terms of complex contact representation and joint configuration coupling. To address this, we propose a new control approach that explicitly manages arbitrary contact combinations, aiming to endow robots with whole-body interactive capabilities. We develop a morphology-constrained WBC network (MorphoGuard)-which is trained on a self-constructed dual-arm physical and simulation platform. A series of model recommendation experiments are designed to systematically investigate the impact of backbone architecture, fusion strategy, and model scale on network performance. To evaluate the control performance, we adopt a multi-object interaction task as the benchmark, requiring the model to simultaneously manipulate multiple target objects to specified positions. Experimental results show that the proposed method achieves a contact point management error of approximately 1 cm, demonstrating its effectiveness in whole-body interactive control.
♻ ☆ TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.
Olaf: Bringing an Animated Character to Life in the Physical World
Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
♻ ☆ Allometric Scaling Laws for Bipedal Robots
Scaling the design of robots up or down remains a fundamental challenge. While biological systems follow well-established isometric and allometric scaling laws relating mass, stride frequency, velocity, and torque, it is unclear how these relationships translate to robotic systems. In this paper, we generate similar allometric scaling laws for bipedal robots across three orders of magnitude in leg length. First, we conduct a review of legged robots from the literature and extract empirical relationships between leg length (L), body length, mass, and speed. These data show that robot mass scales more closely to L^2, in contrast to the L^3 scaling predicted by isometric scaling. We then perform controlled simulation studies in Drake using three variants of real quasi-passive, hip-actuated walkers with different foot geometries and control strategies. We evaluate the performance of each design scaled with leg length, L. Across all robots, walking velocity follows the expected L^(1/2) trend from dynamic similarity. Minimum required torque scales more closely with m*L than the isometric model of m*L^2. Foot geometry scaled proportionally with L^1. These results provide new insight into how robot designs allometrically scale to different sizes, and how that scaling is different from isometric or biological scaling laws.
♻ ☆ ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter
Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
comment: Accepted at CoRL 2024. Project Website:(https://h-freax.github.io/thinkgrasp_page/)
♻ ☆ MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters CVPR 2026
We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.
comment: CVPR 2026
♻ ☆ Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation
Robotic manipulation policies are increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, inference-time attacks against robotic manipulation have been extensively studied, yet backdoor attacks targeting model supply chain security in robotic policies remain largely unexplored. To fill this gap, we propose \texttt{TrojanRobot}, a backdoor injection framework for model supply chain attack scenarios, which embeds a malicious module into modular robotic policies via backdoor relationships to manipulate the LLM-to-VLM pathway and compromise the system. Our vanilla design instantiates this module as a backdoor-finetuned VLM. To further enhance attack performance, we propose a prime scheme by introducing the concept of \textit{LVLM-as-a-backdoor}, which leverages \textit{in-context instruction learning} (ICIL) to steer \textit{large vision-language model} (LVLM) behavior through backdoored system prompts. Moreover, we develop three types of prime attacks, \textit{permutation}, \textit{stagnation}, and \textit{intentional}, achieving flexible backdoor attack effects. Extensive physical-world and simulator experiments on 18 real-world manipulation tasks and 4 VLMs verify the superiority of proposed \texttt{TrojanRobot}
♻ ☆ Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
comment: Accepted in 8th Annual Learning for Dynamics & Control Conference (L4DC)
♻ ☆ How Leg Stiffness Affects Energy Economy in Hopping
In the fields of robotics and biomechanics, the integration of elastic elements such as springs and tendons in legged systems has long been recognized for enabling energy-efficient locomotion. Yet, a significant challenge persists: designing a robotic leg that perform consistently across diverse operating conditions, especially varying average forward speeds. It remains unclear whether, for such a range of operating conditions, the stiffness of the elastic elements needs to be varied or if a similar performance can be obtained by changing the motion and actuation while keeping the stiffness fixed. This work explores the influence of the leg stiffness on the energy efficiency of a monopedal robot through an extensive parametric study of its periodic hopping motion. To this end, we formulate an optimal control problem parameterized by average forward speed and leg stiffness, solving it numerically using direct collocation. Our findings indicate that, compared to the use of a fixed stiffness, employing variable stiffness in legged systems improves energy efficiency by 20 % maximally and by 6.8 % on average across a range of speeds.
♻ ☆ Physical Human-Robot Interaction: A Critical Review of Safety Constraints
This paper aims to provide a clear and rigorous understanding of commonly recognized safety constraints in physical human-robot interaction, particularly regarding ISO/TS 15066. We investigate the derivation of these constraints, critically examine the underlying assumptions, and evaluate their practical implications for system-level safety and performance in industrially relevant scenarios. Key design parameters within safety-critical control architectures are identified, and numerical examples are provided to quantify performance degradation arising from typical approximations and design decisions in manufacturing environments. Within this analysis, the fundamental role of energy in safety assessment is emphasized, providing focused insights into energy-based safety methodologies for collaborative industrial robot systems.
♻ ☆ V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions
Ensuring safety in autonomous systems requires controllers that aim to satisfy state-wise constraints without relying on online interaction.While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they struggle to ensure strict state-wise safety. Conversely, Control Barrier Functions (CBFs) offer a principled mechanism to enforce forward invariance, but often rely on expert-designed barrier functions or knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.
comment: 28 pages, 9 figures, 11 tables. Paper accepted at TMLR
♻ ☆ GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss
Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.
♻ ☆ Multi-Staged Framework for Safety Analysis of Offloaded Services in Distributed Intelligent Transportation Systems
The integration of service-oriented architectures (SOA) with function offloading for distributed, intelligent transportation systems (ITS) offers the opportunity for connected autonomous vehicles (CAVs) to extend their locally available services. One major goal of offloading a subset of functions in the processing chain of a CAV to remote devices is to reduce the overall computational complexity on the CAV. The extension of using remote services, however, requires careful safety analysis, since the remotely created data are corrupted more easily, e.g., through an attacker on the remote device or by intercepting the wireless transmission. To tackle this problem, we first analyze the concept of SOA for distributed environments. From this, we derive a safety framework that validates the reliability of remote services and the data received locally. Since it is possible for the autonomous driving task to offload multiple different services, we propose a specific multi-staged framework for safety analysis dependent on the service composition of local and remote services. For efficiency reasons, we directly include the multi-staged framework for safety analysis in our service-oriented function offloading framework (SOFOF) that we have proposed in earlier work. The evaluation compares the performance of the extended framework considering computational complexity, with energy savings being a major motivation for function offloading, and its capability to detect data from corrupted remote services.
comment: 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC)
♻ ☆ Vi-TacMan: Articulated Object Manipulation via Vision and Touch ICRA 2026
Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.
comment: ICRA 2026
♻ ☆ DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration CVPR 2026
Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: https://ustc3dv.github.io/DualReg/.
comment: Accepted to CVPR 2026, Project page: https://ustc3dv.github.io/DualReg/
♻ ☆ What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty
As artificial agents become increasingly capable, what internal structure is *necessary* for an agent to act competently under uncertainty? Classical results show that optimal control can be *implemented* using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that strong task performance (low *average-case regret*) forces world models, belief-like memory and -- under task mixtures -- persistent variables resembling core primitives associated with emotion, along with informational modularity under block-structured tasks. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary "betting" decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of predictive state and belief-like memory, addressing an open question in prior world-model recovery work.
comment: 23 pages; added PSR recovery (Theorems 3 & 4), and updated related work
♻ ☆ Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning
Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale with compute, as performance quickly saturates when training revisits the same narrow regions of state space. We introduce OmniReset, a simple and scalable framework that enables on-policy reinforcement learning to robustly solve a broad class of dexterous manipulation tasks using a single reward function, fixed algorithm hyperparameters, no curricula, and no human demonstrations. Our key insight is that long-horizon exploration can be dramatically simplified by using simulator resets to systematically expose the RL algorithm to the diverse set of robot-object interactions which underlie dexterous manipulation. OmniReset programmatically generates such resets with minimal human input, converting additional compute directly into broader behavioral coverage and continued performance gains. We show that OmniReset gracefully scales to long-horizon dexterous manipulation tasks beyond the capabilities of existing approaches and is able to learn robust policies over significantly wider ranges of initial conditions than baselines. Finally, we distill OmniReset into visuomotor policies which display robust retrying behavior and substantially higher success rates than baselines when transferred to the real world zero-shot. Project webpage: https://weirdlabuw.github.io/omnireset/
Computer Vision and Pattern Recognition 150
☆ EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors CVPR 2026
We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.
comment: CVPR 2026. Project Page: https://bartn8.github.io/eventhub/ Code: https://github.com/bartn8/eventhub
☆ ActionParty: Multi-Subject Action Binding in Generative Video Games
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
comment: Project page: https://action-party.github.io/
☆ Generative World Renderer
Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.
comment: Project page: https://alaya-studio.github.io/renderer/
☆ Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection CVPR
We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.
comment: Accepted at CVPR Findings 2026
☆ Steerable Visual Representations
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
comment: preprint
☆ Beyond Referring Expressions: Scenario Comprehension Visual Grounding
Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
comment: 20 pages, 18 figures, Project Page: https://catherine-r-he.github.io/RSC/
☆ Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining CVPR2026
High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
comment: Accepted in CVPR2026. Website: https://junxuan-li.github.io/lca
☆ Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.
comment: 10 pages, 6 figures
☆ A Simple Baseline for Streaming Video Understanding
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.
comment: Project page: https://simple-stream.github.io/
☆ VOID: Video Object and Interaction Deletion
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
☆ AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.
☆ Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
☆ Deep Neural Network Based Roadwork Detection for Autonomous Driving
Road construction sites create major challenges for both autonomous vehicles and human drivers due to their highly dynamic and heterogeneous nature. This paper presents a real-time system that detects and localizes roadworks by combining a YOLO neural network with LiDAR data. The system identifies individual roadwork objects while driving, merges them into coherent construction sites and records their outlines in world coordinates. The model training was based on an adapted US dataset and a new dataset collected from test drives with a prototype vehicle in Berlin, Germany. Evaluations on real-world road construction sites showed a localization accuracy below 0.5 m. The system can support traffic authorities with up-to-date roadwork data and could enable autonomous vehicles to navigate construction sites more safely in the future.
comment: 7 pages, 10 figures
☆ Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.
☆ Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
☆ SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation CVPR 2026
Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
comment: Accepted to CVPR 2026
☆ UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.
☆ SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition
Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.
comment: Accepted to ICPR 2026
☆ UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
comment: code has been released at https://github.com/xiaomi-research/unidrivevla
☆ Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention
This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).
☆ CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification
This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.
comment: 5 pages, 3 figures. Accepted to the IEEE ISBI 2026 CXR-LT Challenge
☆ ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
comment: 7 pages, 4 figures
☆ Reflection Generation for Composite Image Using Diffusion Model
Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.
☆ Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation CVPR 2026
Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings
comment: CVPR 2026
☆ CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.
☆ DenOiS: Dual-Domain Denoising of Observation and Solution in Ultrasound Image Reconstruction
Medical imaging aims to recover underlying tissue properties, using inexact (simplified/linearized) imaging models and often from inaccurate and incomplete measurements. Analytical reconstruction methods rely on hand-crafted regularization, sensitive to noise assumptions and parameter tuning. Among deep learning alternatives, plug-and-play (PnP) approaches learn regularization while incorporating imaging physics during inference, outperforming purely data-driven methods. The performance of all these approaches, however, still strongly depends on measurement quality and imaging model accuracy. In this work, we propose DenOiS, a framework that denoises both input observations and resulting solution in their respective domains. It consists of an observation refinement strategy that corrects degraded measurements while compensating for imaging model simplifications, and a diffusion-based PnP reconstruction approach that remains robust under missing measurements. DenOiS enables generalization to real data from training only in simulations, resulting in high-fidelity image reconstruction with noisy observations and inexact imaging models. We demonstrate this for speed-of-sound imaging as a challenging setting of quantitative ultrasound image reconstruction.
☆ CASHG: Context-Aware Stylized Online Handwriting Generation
Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.
comment: 42 pages, 19 figures
☆ LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
☆ GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding CVPR 2026
Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.
comment: Published as a conference paper at CVPR 2026
☆ Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology
Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset's unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at https://github.com/YanKong0408/Center-DETR.
comment: ISBI 2026 Accepted Paper & Winning Solution for the RIVA Cervical Cytology Challenge
☆ FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.
comment: HuggingFace Space: https://huggingface.co/spaces/dominoer/FlowSlider
☆ Country-wide, high-resolution monitoring of forest browning with Sentinel-2
Natural and anthropogenic disturbances are impacting the health of forests worldwide. Monitoring forest disturbances at scale is important to inform conservation efforts. Here, we present a scalable approach for country-wide mapping of forest greenness anomalies at the 10 m resolution of Sentinel-2. Using relevant ecological and topographical context and an established representation of the vegetation cycle, we learn a predictive quantile model of the normalised difference vegetation index (NDVI) derived from Sentinel-2 data. The resulting expected seasonal cycles are used to detect NDVI anomalies across Switzerland between April 2017 and August 2025. Goodness-of-fit evaluations show that the conditional model explains 65% of the observed variations in the median seasonal cycle. The model consistently benefits from the local context information, particularly during the green-up period. The approach produces coherent spatial anomaly patterns and enables country-wide quantification of forest browning. Case studies with independent reference data from known events illustrate that the model reliably detects different types of disturbances.
comment: 9 pages, 7 figures, to be published in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Congress)
☆ PLUME: Latent Reasoning Based Universal Multimodal Embedding
Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.
☆ Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection CVPR 2026
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
comment: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
☆ Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement
Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017--2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5\% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.
comment: Accepted for Poster presentation at the ESCoE Conference on Economic Measurement 2026
☆ CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
comment: Code available at: github.com/Lorenzo-0-0/CompassAD
☆ COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing
Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.
☆ True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines
Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.
comment: 20 pages, 10 figures
☆ Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.
comment: 18 pages, 7 figures
☆ Efficient Reasoning via Thought Compression for Language Segmentation
Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.
☆ IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
comment: Accepted at Conference on Computer Vision and Pattern Recognition Workshops 2026
☆ Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.
☆ Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.
☆ Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
☆ Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.
☆ ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.
☆ MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction
Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.
comment: accepted by ICME 2026, to be published
☆ Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation
Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35\,GB to around 20\,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.
☆ Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
☆ Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
☆ Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.
☆ NearID: Identity Representation Learning via Near-identity Distractors
When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/
comment: Code at https://github.com/Gorluxor/NearID
☆ SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions
3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.
☆ Ego-Grounding for Personalized Question-Answering in Egocentric Videos CVPR'26
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
comment: To appear at CVPR'26
☆ Automated Prostate Gland Segmentation in MRI Using nnU-Net
Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.
comment: 9 pages, 2 tables, 1 figure
☆ MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction
Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.
☆ A Self supervised learning framework for imbalanced medical imaging datasets
Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.
☆ Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.
☆ Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain
The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.
comment: The code will be released at https://github.com/fuyimin96/S2CPNet upon acceptance
☆ Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
☆ Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
comment: 10 pages, 6 figures
☆ Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.
comment: 6 pages, 3 figures, 2 algorithms, ETRA26
Lifting Unlabeled Internet-level Data for 3D Scene Understanding CVPR 2026
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
comment: CVPR 2026. Project page: https://sv-pp.github.io/
☆ Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.
comment: 16 pages, 8 figures, accepted by JSTARS
☆ FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation
Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.
☆ SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes
We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at https://github.com/POSE-Lab/SHARC.
comment: Accepted at ICPR 2026
☆ ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.
☆ Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters CVPR
Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.
comment: Text-to-Image version of the Anyone can Jailbreak paper. Accepted in CVPR-W AIMS 2026
☆ GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.
☆ A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.
☆ HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
☆ GeoAI Agency Primitives
We present ongoing research on agency primitives for GeoAI assistants -- core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer -- including navigation, perception, geo-referenced memory, and dual modeling -- along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.
☆ MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.
comment: Accepted by AMME 2025
☆ Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation CVPR2026
Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.
comment: Accepted by CVPR2026 Workshop "AI-driven Skilled Activity Understanding, Assessment & Feedback Generation (SAUAFG)"
☆ Enhanced Polarization Locking in VCSELs
While optical injection locking (OIL) of vertical-cavity surface-emitting lasers (VCSELs) has been widely studied in the past, the polarization dynamics of OIL have received far less attention. Recent studies suggest that polarization locking via OIL could enable novel computational applications such as polarization-encoded Ising computers. However, the inherent polarization preference and limited polarization switchability of VCSELs hinder their use for such purposes. To address these challenges, we fabricate VCSELs with tailored oxide aperture designs and combine these with bias current tuning to study the overall impact on polarization locking. Experimental results demonstrate that this approach reduces the required injection power (to as low as 3.6 μW) and expands the locking range. To investigate the impact of the approach, the spin-flip model (SFM) is used to analyze the effects of amplitude anisotropy and bias current on polarization locking, demonstrating strong coherence with experimental results.
☆ Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
☆ FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting
Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: https://github.com/PaPieta/fact-gs.
☆ Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
comment: 15 pages plus references; 5 figures; supplementary appended; accepted to ICPR 2026
☆ Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers
Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.
☆ Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification
Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.
☆ Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
☆ SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers CVPR26
Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.
comment: CVPR26
☆ STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
☆ A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.
☆ PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency CVPR 2026
Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.
comment: Accepted at CVPR 2026
☆ GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents CVPR 2026
Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at https://monad-cube.github.io/GardenDesigner.
comment: CVPR 2026, Project page: https://monad-cube.github.io/GardenDesigner
☆ FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.
comment: Paper in-review
☆ DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
comment: 11 pages, 4 figures; Project Website: https://drivedreamer-policy.github.io/
☆ Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning ICLR 2026
Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.
comment: Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)
☆ Cosine-Normalized Attention for Hyperspectral Image Classification
Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.
☆ Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.
comment: project page https://dedoardo.github.io/projects/control-dino/
☆ Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding CVPR 2026
Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.
comment: Accepted by CVPR 2026
☆ Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception
Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.
comment: 15 pages, 10 figures
☆ Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation
Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.
☆ Setup-Independent Full Projector Compensation
Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: https://hai-bo-li.github.io/SIComp/
comment: 16 pages,17 figures
☆ SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing
Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.
☆ End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement CVPR2026
This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: https://github.com/chihina/sagd-CVPRW2026.
comment: Accepted to CVPR2026 Workshop (GAZE 2026)
☆ Bias mitigation in graph diffusion models ICLR 2025
Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion's maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.Code is at https://github.com/kunzhan/spp
comment: Accepted to ICLR 2025!
☆ Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.
From Understanding to Erasing: Towards Complete and Stable Video Object Removal
Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.
☆ BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.
☆ Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding
Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.
comment: Project page: https://caiyw2023.github.io/Director/
☆ GPA: Learning GUI Process Automation from Demonstrations
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
☆ HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation
Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.
☆ Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion
Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.
comment: Accepted by ICME2026
♻ ☆ LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/.
♻ ☆ Scaling Video Pretraining for Surgical Foundation Models
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
♻ ☆ HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering CVPR 2026
Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20\% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. HERBench thus provides a principled benchmark for studying robust multi-evidence video understanding.
comment: Accepted to CVPR 2026
♻ ☆ FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution CVPR 2026
Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.
comment: CVPR 2026 (camera ready ver.). Please visit our project page at https://cmlab-korea.github.io/FRAMER/
♻ ☆ Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
comment: Updated first authors
OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model have already been publicly available. Project Page: https://omniweaving.github.io.
comment: 32 pages, 22 figures. Project Page: https://omniweaving.github.io. Github: https://github.com/Tencent-Hunyuan/OmniWeaving. Model: https://huggingface.co/tencent/HY-OmniWeaving
♻ ☆ Tackling Non-IIDness in HAPS-Aided Federated Learning
High-altitude platform stations (HAPS) enable large-scale federated learning (FL) in non-terrestrial networks (NTN) by providing wide-area coverage and predominantly line-of-sight (LoS) connectivity to many ground users. However, practical deployments face heterogeneous and non-independently and identically distributed (non-IID) client data, which degrades accuracy and slows convergence. We propose a weighted attribute-based client selection strategy that leverages server-side indicators: historical traffic behavior, instantaneous channel quality, computational capability, and prior-round learning contribution. At each round, the HAPS computes a composite score and selects the top clients, while adapting attribute weights online based on their correlation with validation-loss improvement. We further provide theoretical justification that traffic-derived uniformity can serve as a proxy for latent data heterogeneity, enabling selection of client subsets with reduced expected non-IIDness. Simulations demonstrate improved test accuracy, faster convergence, and lower training loss compared with random, resource-only, and single-attribute baselines.
comment: Submitted to IEEE for possible publication
♻ ☆ PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation CVPR 2026
Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
comment: Accepted to CVPR 2026 Code: https://github.com/GasaiYU/PAM
♻ ☆ Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.
♻ ☆ Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models
State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.
♻ ☆ Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models
Diffusion models are powerful deep generative models, but unlike classical models, they lack an explicit low-dimensional latent space that parameterizes the data manifold. This absence makes it difficult to perform manifold-aware operations, such as geometrically faithful interpolation or conditional guidance that respects the learned manifold. We propose a training-free Riemannian metric on the noise space, derived from the Jacobian of the score function. The key insight is that the spectral structure of this Jacobian separates tangent and normal directions of the data manifold; our metric leverages this separation to encourage paths to stay tangential to the manifold rather than drift toward high-density regions. To validate that our metric faithfully captures the manifold geometry, we examine it from two complementary angles. First, geodesics under our metric yield perceptually more natural interpolations than existing methods on synthetic, image, and video frame datasets. Second, the tangent-normal decomposition induced by our metric prevents classifier-free guidance from deviating off the manifold, improving generation quality while preserving text-image alignment.
♻ ☆ DeDelayed: Deleting Remote Inference Delay via On-Device Correction CVPR 2026
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
comment: CVPR 2026
♻ ☆ SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of $14,116$ dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID ($\approx 65.5$) and KID ($\approx 0.05$), while diffusion models generated higher variance samples at the cost of reduced perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with $8$-$15$\% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1 $\approx 0.88$ and ROC-AUC $\approx 0.98$, representing an improvement of approximately $14\%$ over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench
♻ ☆ Towards Faithful Reasoning in Comics for Small MLLMs
Comic understanding presents a significant challenge for Multimodal Large Language Models (MLLMs), as the intended meaning of a comic often emerges from the joint interpretation of visual, textual, and social cues. This naturally motivates Chain-of-Thought (CoT) prompting, since explicit intermediate reasoning appears promising for integrating such heterogeneous signals. However, existing CoT methods are poorly matched to this structure: they tend to force interpretation into a single reasoning path before multiple cues have been jointly considered, often degrading performance, especially for small MLLMs. Our key idea is to explicitly preserve multi-cue interpretation during supervision construction, rather than collapsing comic understanding into a single reasoning chain. To this end, we propose a two-stage framework for faithful comic reasoning in small MLLMs. First, we introduce MoCoT, a modular supervision construction framework that preserves multi-cue interpretation and turns it into more faithful supervision. Second, we propose VERA, a structured reward mechanism that turns such supervision into faithful reasoning behavior by aligning optimization with both reasoning faithfulness and answer correctness. Extensive experiments on five benchmarks spanning comic understanding and broader humor-centric and abstract visual reasoning tasks demonstrate that our framework achieves strong results in the $\leq$ 4B regime, surpasses several 7B baselines, improves four small MLLMs by an average of $\mathbf{12.1%}$ as a plug-in, and consistently enhances reasoning faithfulness while preserving inference efficiency.
♻ ☆ Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference
Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
♻ ☆ Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
comment: Demo video url: https://jn00767.pages.surrey.ac.uk/catalogue-grounded-multimodal-attribution-for-museum-video/
♻ ☆ Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method
Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses and patch-level comparison with publicly available building height products indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.
♻ ☆ Human-Centric Perception for Child Sexual Abuse Imagery
Law enforcement agencies and non-gonvernmental organizations handling reports of Child Sexual Abuse Imagery (CSAI) are overwhelmed by large volumes of data, requiring the aid of automation tools. However, defining sexual abuse in images of children is inherently challenging, encompassing sexually explicit activities and hints of sexuality conveyed by the individual's pose, or their attire. CSAI classification methods often rely on black-box approaches, targeting broad and abstract concepts such as pornography. Thus, our work is an in-depth exploration of tasks from the literature on Human-Centric Perception, across the domains of safe images, adult pornography, and CSAI, focusing on targets that enable more objective and explainable pipelines for CSAI classification in the future. We introduce the Body-Keypoint-Part Dataset (BKPD), gathering images of people from varying age groups and sexual explicitness to approximate the domain of CSAI, along with manually curated hierarchically structured labels for skeletal keypoints and bounding boxes for person and body parts, including head, chest, hip, and hands. We propose two methods, namely BKP-Association and YOLO-BKP, for simultaneous pose estimation and detection, with targets associated per individual for a comprehensive decomposed representation of each person. Our methods are benchmarked on COCO-Keypoints and COCO-HumanParts, as well as our human-centric dataset, achieving competitive results with models that jointly perform all tasks. Cross-domain ablation studies on BKPD and a case study on RCPD highlight the challenges posed by sexually explicit domains. Our study addresses previously unexplored targets in the CSAI domain, paving the way for novel research opportunities.
comment: submitted to IEEE Transactions on Information Forensics and Security (TIFS)
♻ ☆ Seamless High-Resolution Terrain Reconstruction: A Prior-Based Vision Transformer Approach
High-resolution elevation data is essential for hydrological modeling, hazard assessment, and environmental monitoring; however, globally consistent, fine-scale Digital Elevation Models (DEMs) remain unavailable. Very high-resolution single-view imagery enables the extraction of topographic information at the pixel level, allowing the reconstruction of fine terrain details over large spatial extents. In this paper, we present single-view-based DEM reconstruction shown to support practical analysis in GIS environments across multiple sub-national jurisdictions. Specifically, we produce high-resolution DEMs for large-scale basins, representing a substantial improvement over the 30 m resolution of globally available Shuttle Radar Topography Mission (SRTM) data. The DEMs are generated using a prior-based monocular depth foundation (MDE) model, extended in this work to the remote sensing height domain for high-resolution, globally consistent elevation reconstruction. We fine-tune the model by integrating low-resolution SRTM data as a global prior with high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP), producing DEMs with near LiDAR-level accuracy. Our method achieves a 100x resolution enhancement (from 30 m to 30 cm), exceeding existing super-resolution approaches by an order of magnitude. Across two diverse landscapes, the model generalizes robustly, resolving fine-scale terrain features with a mean absolute error of less than 5 m relative to LiDAR and improving upon SRTM by up to 18 %. Hydrological analyses at both catchment and hillslope scales confirm the method's utility for hazard assessment and environmental monitoring, demonstrating improved streamflow representation and catchment delineation. Finally, we demonstrate the scalability of the framework by applying it across large geographic regions.
♻ ☆ PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git
comment: We would like to withdraw this paper due to identified issues in the experimental design and insufficient supporting data, which affect the reliability of the reported results. A substantially revised version with corrected experiments and extended evaluations will be prepared and submitted in the future
♻ ☆ SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. To ensure precise geometric fidelity, we constrain traditional 3D Gaussians into planar primitives, facilitating accurate normal and depth estimation. The planar Gaussians are then optimized in a coarse-to-fine manner, regularized by depth smoothness and few-shot diffusion priors. Furthermore, we leverage a Vision-Language Model (VLM) via visual prompting to achieve open-vocabulary part segmentation and joint parameter estimation. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing baselines, achieving superior part-level surface reconstruction fidelity. Code and data are provided in the supplementary material.
comment: 10 pages, 7 figures
♻ ☆ EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer CVPR 2026
Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as "unknown": all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.
comment: Accepted at CVPR 2026
♻ ☆ OOD-SEG: Exploiting out-of-distribution detection techniques for learning image segmentation from sparse multi-class positive-only annotations
Despite significant advancements, segmentation based on deep neural networks in medical and surgical imaging faces several challenges, two of which we aim to address in this work. First, acquiring complete pixel-level segmentation labels for medical images is time-consuming and requires domain expertise. Second, typical segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them prone to spurious outputs during deployment. In this work, we propose a novel segmentation approach which broadly falls within the positive-unlabelled (PU) learning paradigm and exploits tools from OOD detection techniques. Our framework learns only from sparsely annotated pixels from multiple positive-only classes and does not use any annotation for the background class. These multi-class positive annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may contain positive classes but also negative ones, including what is typically referred to as \emph{background} in standard segmentation formulations. To the best of our knowledge, this work is the first to formulate multi-class segmentation with sparse positive-only annotations as a pixel-wise PU learning problem and to address it using OOD detection techniques. Here, we forgo the need for background annotation and consider these together with any other unseen classes as part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection approaches designed for classification tasks. To address the lack of existing OOD datasets and established evaluation metric for medical image segmentation, we propose a cross-validation strategy that treats held-out labelled classes as OOD. Extensive experiments on both multi-class hyperspectral and RGB surgical imaging datasets demonstrate the robustness and generalisation capability of our proposed framework.
comment: Accepted in MedIA
♻ ☆ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
comment: 10 pages, 6 figures
♻ ☆ CARE: Confidence-aware Ratio Estimation for Medical Biomarkers
Ratio-based biomarkers (RBBs), such as the proportion of necrotic tissue within a tumor, are widely used in clinical practice to support diagnosis, prognosis, and treatment planning. These biomarkers are typically estimated from segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified confidence-aware framework for estimating ratio-based biomarkers. Our uncertainty analysis stems from two observations: (1) the probability ratio estimator inherently admits a statistical confidence interval regarding local randomness (bias and variance); (2) the segmentation network is not perfectly calibrated (calibration error).We perform a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of segmentation-derived RBBs in clinical workflows.
comment: 12 pages
♻ ☆ FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry
The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.
♻ ☆ MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters CVPR 2026
We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.
comment: CVPR 2026
♻ ☆ Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement
Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.
♻ ☆ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
comment: Published in Transactions on Machine Learning Research (03/2026)
♻ ☆ Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data
Contrastive self supervised learning(CSSL) usually makes use of the multi-view assumption which states that all relevant information must be shared between all views. The main objective of CSSL is to maximize the mutual information(MI) between representations of different views and at the same time compress irrelevant information in each representation. Recently, as part of future work, Schwartz Ziv & Yan LeCun pointed out that, when the multi-view assumption is violated, one of the most significant challenges in SSL is in identifying new methods to separate relevant from irrelevant information based on alternative assumptions. Taking a cue from this intuition we make the following contributions in this paper: 1) We develop a CSSL framework wherein multiple images and multiple views(MIMV) are considered as input, which is different from the traditional multi-view assumption 2) We adopt a novel augmentation strategy that includes both normalized (invertible) and augmented (non-invertible) views so that complete information of one image can be preserved and hard augmentation can be chosen for the other image 3) An Information bottleneck(IB) principle is outlined for MIMV to produce optimal representations 4) We introduce a loss function that helps to learn better representations by filtering out extreme features 5) The robustness of our proposed framework is established by applying it to the imbalanced dataset problem wherein we achieve a new state-of-the-art accuracy (2% improvement in Cifar10-LT using Resnet-18, 5% improvement in Cifar100-LT using Resnet-18 and 3% improvement in Imagenet-LT (1k) using Resnet-50).
♻ ☆ Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events CVPR
Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A 2D pixel-wise RGB CRF model is introduced to align the NeRF rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event CRF model is also designed to bridge the gap between physical scene dynamics and event sensor output. The two models are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method achieves state-of-the-art HDR and deblurring novel view synthesis results with single-exposure blurry LDR images and corresponding events.
comment: Accepted by the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. Project Page: https://icvteam.github.io/See-NeRF.html. Our code and datasets are publicly available at https://github.com/iCVTEAM/See-NeRF
♻ ☆ KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Knowledge-Intensive Visual Grounding (KVG) requires models to localize objects using fine-grained, domain-specific entity names rather than generic referring expressions. Although Multimodal Large Language Models (MLLMs) possess rich entity knowledge and strong generic grounding capabilities, they often fail to effectively utilize such knowledge when grounding specialized concepts, revealing a knowledge-grounding gap between internal knowledge and grounding predictions. To address this challenge, we propose a knowledge-aware training paradigm for KVG. Our approach first constructs knowledge-guided reasoning data to encourage models to activate domain-relevant entity knowledge during grounding, and then introduces KARL, a Knowledge-Aware Reinforcement Learning framework that adaptively modulates reward signals according to the model's estimated knowledge mastery of different entities. To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities. Extensive experiments show that our approach consistently outperforms a wide range of baseline models and achieves substantially stronger cross-domain generalization on unseen categories. The data, codes, and models are released at https://github.com/thunlp/KARL.
♻ ☆ Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.
♻ ☆ GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss
Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.
♻ ☆ Learning to Translate Noise for Robust Image Denoising CVPR 2026
Deep learning-based image denoising techniques often struggle with poor generalization performance to out-of-distribution real-world noise. To tackle this challenge, we propose a novel noise translation framework that performs denoising on an image with translated noise rather than directly denoising an original noisy image. Specifically, our approach translates complex, unknown real-world noise into Gaussian noise, which is spatially uncorrelated and independent of image content, through a noise translation network. The translated noisy images are then processed by an image denoising network pretrained to effectively remove Gaussian noise, enabling robust and consistent denoising performance. We also design well-motivated loss functions and architectures for the noise translation network by leveraging the mathematical properties of Gaussian noise. Experimental results demonstrate that the proposed method substantially improves robustness and generalizability, outperforming state-of-the-art methods across diverse benchmarks. Visualized denoising results and the source code are available on our project page.
comment: Project page: https://hij1112.github.io/learning-to-translate-noise/ Accepted to CVPR 2026 Findings
♻ ☆ SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free approaches are limited to moderate sparsity and thus yield only modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. Leveraging a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, our method attains up to 90% sparsity and 1.52-2.03x inference speedup across different models and sequence lengths, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples, fewer than 1,600 training steps, and no more than 30 GPU hours with a batch size of 8.
♻ ☆ ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation. This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions. To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations. By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity. Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
♻ ☆ GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.
♻ ☆ When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
♻ ☆ Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss
Novel view synthesis is a fundamental task in 3D computer vision that aims to reconstruct photorealistic images from novel viewpoints given a set of posed images. However, reconstruction quality degrades sharply under sparse-view conditions due to insufficient geometric cues. Existing methods, including Neural Radiance Fields (NeRF) and more recent 3D Gaussian Splatting (3DGS), often exhibit blurred details and structural artifacts when trained from sparse observations. Recent works have identified rendered depth quality as a key factor in mitigating these artifacts, as it directly affects geometric accuracy and view consistency. However, effectively leveraging depth under sparse views remains challenging. Depth priors can be noisy or misaligned with rendered geometry, and single-scale supervision often fails to capture both global structure and fine details. To address these challenges, we introduce Hierarchical Depth-Guided Splatting (HDGS), a depth supervision framework that progressively refines geometry from coarse to fine levels. Central to HDGS is our novel Cascade Pearson Correlation Loss (CPCL), which enforces consistency between rendered and estimated depth priors across multiple spatial scales. By enforcing multi-scale depth consistency, our method improves structural fidelity in sparse-view reconstruction. Experiments on LLFF and DTU demonstrate state-of-the-art performance under sparse-view settings.
♻ ☆ PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis CVPR 2026
We introduce PhysGaia, a novel physics-aware benchmark for Dynamic Novel View Synthesis (DyNVS) that encompasses both structured objects and unstructured physical phenomena. While existing datasets primarily focus on photorealistic appearance, PhysGaia is specifically designed to support physics-consistent dynamic reconstruction. Our benchmark features complex scenarios with rich multi-body interactions, where objects realistically collide and exchange forces. Furthermore, it incorporates a diverse range of materials, including liquid, gas, textile, and rheological substance, moving beyond the rigid-body assumptions prevalent in prior work. To ensure physical fidelity, all scenes in PhysGaia are generated using material-specific physics solvers that strictly adhere to fundamental physical laws. We provide comprehensive ground-truth information, including 3D particle trajectories and physical parameters (e.g., viscosity), enabling the quantitative evaluation of physical modeling. To facilitate research adoption, we also provide integration pipelines for recent 4D Gaussian Splatting models along with our dataset and their results. By addressing the critical shortage of physics-aware benchmarks, PhysGaia can significantly advance research in dynamic view synthesis, physics-based scene understanding, and the integration of deep learning with physical simulation, ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes.
comment: Accepted at CVPR 2026; Project page: http://cvlab.snu.ac.kr/research/PhysGaia; Dataset: https://huggingface.co/datasets/mijeongkim/PhysGaia/tree/main
♻ ☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
comment: 12 pages
♻ ☆ Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Recent advances in generative modeling have enabled perceptual video compression at ultra-low bitrates, yet existing methods predominantly treat the generative model as a refinement or reconstruction module attached to a separately designed codec backbone. We propose \emph{Generative Video Codebook Codec} (GVCC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with autoregressive GOP chaining, tail latent residual correction, and adaptive atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVCC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
comment: 9 pages, 3 figures
♻ ☆ ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resolution
Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.
♻ ☆ Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.
Machine Learning 150
☆ ActionParty: Multi-Subject Action Binding in Generative Video Games
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
comment: Project page: https://action-party.github.io/
☆ Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
☆ Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
comment: 43 pages, 5 figures, 24 tables
☆ Topological Effects in Neural Network Field Theory
Neural network field theory formulates field theory as a statistical ensemble of fields defined by a network architecture and a density on its parameters. We extend the construction to topological settings via the inclusion of discrete parameters that label the topological quantum number. We recover the Berezinskii--Kosterlitz--Thouless transition, including the spin-wave critical line and the proliferation of vortices at high temperatures. We also verify the T-duality of the bosonic string, showing invariance under the exchange of momentum and winding on $S^1$, the transformation of the sigma model couplings according to the Buscher rules on constant toroidal backgrounds, the enhancement of the current algebra at self-dual radius, and non-geometric T-fold transition functions.
comment: 55 pages, 8 figures
☆ go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.
comment: 29 pages, 30 figures, 9 tables. Includes supplementary material
☆ Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.
☆ Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
☆ De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
☆ Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
comment: 39 pages, 13 figures. Code available at: https://github.com/joshrosie/crystalite
☆ SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.
☆ Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.
comment: 15 pages, 5 figues, 2 tables. This work has been submitted to the IEEE for possible publication
☆ Best-Arm Identification with Noisy Actuation
In this paper, we consider a multi-armed bandit (MAB) instance and study how to identify the best arm when arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC). Depending on the agent capabilities, we provide communication schemes along with their analysis, which interestingly relate to the zero-error capacity of the underlying DMC.
☆ Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives
Understanding causal dependencies in observational data is critical for informing decision-making. These relationships are often modeled as Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). Existing methods, such as NOTEARS and DAG-GNN, often face issues with scalability and stability in high-dimensional data, especially when there is a feature-sample imbalance. Here, we show that the denoising score matching objective of diffusion models could smooth the gradients for faster, more stable convergence. We also propose an adaptive k-hop acyclicity constraint that improves runtime over existing solutions that require matrix inversion. We name this framework Denoising Diffusion Causal Discovery (DDCD). Unlike generative diffusion models, DDCD utilizes the reverse denoising process to infer a parameterized causal structure rather than to generate data. We demonstrate the competitive performance of DDCDs on synthetic benchmarking data. We also show that our methods are practically useful by conducting qualitative analyses on two real-world examples. Code is available at this url: https://github.com/haozhu233/ddcd.
comment: To appear in the Proceedings of the 5th Conference on Causal Learning and Reasoning (CLeaR 2026)
☆ BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy
Multimodal time-to-event prediction often requires integrating sensitive data distributed across multiple parties, making centralized model training impractical due to privacy constraints. At the same time, most existing multimodal survival models produce single deterministic predictions without indicating how confident the model is in its estimates, which can limit their reliability in real-world decision making. To address these challenges, we propose BVFLMSP, a Bayesian Vertical Federated Learning (VFL) framework for multimodal time-to-event analysis based on a Split Neural Network architecture. In BVFLMSP, each client independently models a specific data modality using a Bayesian neural network, while a central server aggregates intermediate representations to perform survival risk prediction. To enhance privacy, we integrate differential privacy mechanisms by perturbing client side representations before transmission, providing formal privacy guarantees against information leakage during federated training. We first evaluate our Bayesian multimodal survival model against widely used single modality survival baselines and the centralized multimodal baseline MultiSurv. Across multimodal settings, the proposed method shows consistent improvements in discrimination performance, with up to 0.02 higher C-index compared to MultiSurv. We then compare federated and centralized learning under varying privacy budgets across different modality combinations, highlighting the tradeoff between predictive performance and privacy. Experimental results show that BVFLMSP effectively includes multimodal data, improves survival prediction over existing baselines, and remains robust under strict privacy constraints while providing uncertainty estimates.
☆ (PAC-)Learning state machines from data streams: A generic strategy and an improved heuristic (Extended version)
This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds, and the discussion and analysis of a similar approach has been moved from the appendix and is now a full Section. State machines models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the beginning of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete prefix trees. We implement our approach in an open-source state merging library and compare it with existing methods. We show the effectiveness of our approach with respect to run-time, memory consumption, and quality of results on a well known open dataset. Additionally, we provide a formal analysis of our algorithm, showing that it is capable of learning within the PAC framework, and show a theoretical improvement to increase run-time, without sacrificing correctness of the algorithm in larger sample sizes.
comment: Extended version of Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco
☆ When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning
Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
comment: In Proceedings of International Joint Conference on Neural Networks (IJCNN)
☆ Universal Hypernetworks for Arbitrary Models
Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.
☆ LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.
comment: 10 pages, 6 figures
☆ On the Role of Depth in the Expressivity of RNNs
The benefits of depth in feedforward neural networks are well known: composing multiple layers of linear transformations with nonlinear activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs' memory capacity efficiently with respect to the number of parameters, thus enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We broaden our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.
☆ From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems
While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards. Current EASA guidelines mandate demonstrating complete coverage of the AI/ML constituent's Operational Design Domain (ODD) -- a requirement that demands proof that no critical gaps exist within defined operational boundaries. However, as systems operate within high-dimensional parameter spaces, existing methods struggle to provide the scalability and formal grounding necessary to satisfy the completeness criterion. Currently, no standardized engineering method exists to bridge the gap between abstract ODD definitions and verifiable evidence. This paper addresses this void by proposing a method that integrates parameter discretization, constraint-based filtering, and criticality-based dimension reduction into a structured, multi-step ODD coverage verification process. Grounded in gathered simulation data from prior research on AI-based mid-air collision avoidance research, this work demonstrates a systematic engineering approach to defining and achieving coverage metrics that satisfy EASA's demand for completeness. Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA's standards.
☆ Computing the Exact Pareto Front in Average-Cost Multi-Objective Markov Decision Processes
Many communication and control problems are cast as multi-objective Markov decision processes (MOMDPs). The complete solution to an MOMDP is the Pareto front. Much of the literature approximates this front via scalarization into single-objective MDPs. Recent work has begun to characterize the full front in discounted or simple bi-objective settings by exploiting its geometry. In this work, we characterize the exact front in average-cost MOMDPs. We show that the front is a continuous, piecewise-linear surface lying on the boundary of a convex polytope. Each vertex corresponds to a deterministic policy, and adjacent vertices differ in exactly one state. Each edge is realized as a convex combination of the policies at its endpoints, with the mixing coefficient given in closed form. We apply these results to a remote state estimation problem, where each vertex on the front corresponds to a threshold policy. The exact Pareto front and solutions to certain non-convex MDPs can be obtained without explicitly solving any MDP.
☆ Neural network methods for two-dimensional finite-source reflector design
We address the inverse problem of designing two-dimensional reflectors that transform light from a finite, extended source into a prescribed far-field distribution. We propose a neural network parameterization of the reflector height and develop two differentiable objective functions: (i) a direct change-of-variables loss that pushes the source distribution through the learned inverse mapping, and (ii) a mesh-based loss that maps a target-space grid back to the source, integrates over intersections, and remains continuous even when the source is discontinuous. Gradients are obtained via automatic differentiation and optimized with a robust quasi-Newton method. As a comparison, we formulate a deconvolution baseline built on a simplified finite-source approximation: a 1D monotone mapping is recovered from flux balance, yielding an ordinary differential equation solved in integrating-factor form; this solver is embedded in a modified Van Cittert iteration with nonnegativity clipping and a ray-traced forward operator. Across four benchmarks -- continuous and discontinuous sources, and with/without minimum-height constraints -- we evaluate accuracy by ray-traced normalized mean absolute error (NMAE). Our neural network approach converges faster and achieves consistently lower NMAE than the deconvolution method, and handles height constraints naturally. We discuss how the method may be extended to rotationally symmetric and full three-dimensional settings via iterative correction schemes.
comment: 20 pages, 10 figures, 1 table. Submitted to Machine Learning: Science and Technology
☆ The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
☆ Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.
comment: 8 pages
☆ A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our prediction framework consists of two stages: 1) using only the Slurm accounting logs as training data and 2) augmenting the training data with historical GPU profiling metrics collected with DCGM. The maximum GPU utilization predictions using only the Slurm submission features achieve up to 97% accuracy. Furthermore, features engineered from GPU-compute and memory activity metrics exhibit good correlations with average power utilization, and our runtime power usage prediction experiments result in up to 92% prediction accuracy. These findings demonstrate the effectiveness of DCGM metrics in capturing application characteristics and highlight their potential for developing predictive models to support dynamic power management in HPC systems.
comment: 9 pages, 6 figures
☆ AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
comment: 9 pages, 2 figures
☆ Auction-Based Online Policy Adaptation for Evolving Objectives
We consider multi-objective reinforcement learning problems where objectives come from an identical family -- such as the class of reachability objectives -- and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on Atari Assault and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.
comment: 17 pages, 6 figures
☆ AEGIS: Adversarial Entropy-Guided Immune System -- Thermodynamic State Space Models for Zero-Day Network Evasion Detection
As TLS 1.3 encryption limits traditional Deep Packet Inspection (DPI), the security community has pivoted to Euclidean Transformer-based classifiers (e.g., ET-BERT) for encrypted traffic analysis. However, these models remain vulnerable to byte-level adversarial morphing -- recent pre-padding attacks reduced ET-BERT accuracy to 25.68%, while VLESS Reality bypasses certificate-based detection entirely. We introduce AEGIS: an Adversarial Entropy-Guided Immune System powered by a Thermodynamic Variance-Guided Hyperbolic Liquid State Space Model (TVD-HL-SSM). Rather than competing in the Euclidean payload-reading domain, AEGIS discards payload bytes in favor of 6-dimensional continuous-time flow physics projected into a non-Euclidean Poincare manifold. Liquid Time-Constants measure microsecond IAT decay, and a Thermodynamic Variance Detector computes sequence-wide Shannon Entropy to expose automated C2 tunnel anomalies. A pure C++ eBPF Harvester with zero-copy IPC bypasses the Python GIL, enabling a linear-time O(N) Mamba-3 core to process 64,000-packet swarms at line-rate. Evaluated on a 400GB, 4-tier adversarial corpus spanning backbone traffic, IoT botnets, zero-days, and proprietary VLESS Reality tunnels, AEGIS achieves an F1-score of 0.9952 and 99.50% True Positive Rate at 262 us inference latency on an RTX 4090, establishing a new state-of-the-art for physics-based adversarial network defense.
comment: 10 pages, 3 figures, 3 tables
☆ Application of parametric Shallow Recurrent Decoder Network to magnetohydrodynamic flows in liquid metal blankets of fusion reactors
Magnetohydrodynamic (MHD) phenomena play a pivotal role in the design and operation of nuclear fusion systems, where electrically conducting fluids (such as liquid metals or molten salts employed in reactor blankets) interact with magnetic fields of varying intensity and orientation, influencing the resulting flow dynamics. The numerical solution of MHD models entails the resolution of highly nonlinear, multiphysics systems of equations, which can become computationally demanding, particularly in multi-query, parametric, or real-time contexts. This study investigates a fully data-driven framework for MHD state reconstruction that integrates dimensionality reduction through Singular Value Decomposition (SVD) with the SHallow REcurrent Decoder (SHRED), a neural network architecture designed to reconstruct the full spatio-temporal state from sparse time-series measurements of selected observables, including previously unseen parametric configurations. The SHRED methodology is applied to a three-dimensional geometry representative of a portion of a WCLL blanket cell, in which lead-lithium flows around a water-cooled tube. Multiple magnetic field configurations are examined, including constant toroidal fields, combined toroidal-poloidal fields, and time-dependent magnetic fields. Across all considered scenarios, SHRED achieves high reconstruction accuracy, robustness, and generalization to magnetic field intensities, orientations, and temporal evolutions not seen during training. Notably, in the presence of time-varying magnetic fields, the model accurately infers the temporal evolution of the magnetic field itself using temperature measurements alone. Overall, the findings identify SHRED as a computationally efficient, data-driven, and flexible approach for MHD state reconstruction, with significant potential for real-time monitoring, diagnostics and control in fusion reactor systems.
☆ Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization
Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.
comment: 8 pages, 4 figures, 2 tables
☆ Gradient estimators for parameter inference in discrete stochastic kinetic models
Stochastic kinetic models are ubiquitous in physics, yet inferring their parameters from experimental data remains challenging. In deterministic models, parameter inference often relies on gradients, as they can be obtained efficiently through automatic differentiation. However, these tools cannot be directly applied to stochastic simulation algorithms (SSA) such as the Gillespie algorithm, since sampling from a discrete set of reactions introduces non-differentiable operations. In this work, we adopt three gradient estimators from machine learning for the Gillespie SSA: the Gumbel-Softmax Straight-Through (GS-ST) estimator, the Score Function estimator, and the Alternative Path estimator. We compare the properties of all estimators in two representative systems exhibiting relaxation or oscillatory dynamics, where the latter requires gradient estimation of time-dependent objective functions. We find that the GS-ST estimator mostly yields well-behaved gradient estimates, but exhibits diverging variance in challenging parameter regimes, resulting in unsuccessful parameter inference. In these cases, the other estimators provide more robust, lower variance gradients. Our results demonstrate that gradient-based parameter inference can be integrated effectively with the Gillespie SSA, with different estimators offering complementary advantages.
comment: 13 pages, 6 figures
☆ AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression
We introduce a fast low-rank factorization-based framework for compressing large language models that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on large language models show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.
☆ Cross-Modal Visuo-Tactile Object Perception
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.
comment: 23 pages, 8 figures, 1 table. Submitted for review to journal
☆ CASHG: Context-Aware Stylized Online Handwriting Generation
Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.
comment: 42 pages, 19 figures
☆ Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.
comment: Submitted to Interspeech 2026; 6 pages, 4 figures
☆ LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
☆ Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection CVPR 2026
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
comment: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
☆ Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros
comment: 10 pages, 5 tables, 1 figure, 1 algorithm. Code: https://github.com/RightNow-AI/ouroboros
☆ Reinforcement Learning for Speculative Trading under Exploratory Framework
We study a speculative trading problem within the exploratory reinforcement learning (RL) framework of Wang et al. [2020]. The problem is formulated as a sequential optimal stopping problem over entry and exit times under general utility function and price process. We first consider a relaxed version of the problem in which the stopping times are modeled by the jump times of Cox processes driven by bounded, non-randomized intensity controls. Under the exploratory formulation, the agent's randomized control is characterized via the probability measure over the jump intensities, and their objective function is regularized by Shannon's differential entropy. This yields a system of the exploratory HJB equations and Gibbs distributions in closed-form as the optimal policy. Error estimates and convergence of the RL objective to the value function of the original problem are established. Finally, an RL algorithm is designed, and its implementation is showcased in a pairs-trading application.
comment: 37 pages, 14 figures
☆ IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
comment: Accepted at Conference on Computer Vision and Pattern Recognition Workshops 2026
☆ Systematic Analyses of Reinforcement Learning Controllers in Signalized Urban Corridors
In this work, we extend our systematic capacity region perspective to multi-junction traffic networks, focussing on the special case of an urban corridor network. In particular, we train and evaluate centralized, fully decentralized, and parameter-sharing decentralized RL controllers, and compare their capacity regions and ATTs together with a classical baseline MaxPressure controller. Further, we show how the parametersharing controller may be generalised to be deployed on a larger network than it was originally trained on. In this setting, we show some initial findings that suggest that even though the junctions are not formally coordinated, traffic may self organise into `green waves'.
☆ Feature Weighting Improves Pool-Based Sequential Active Learning for Regression
Pool-based sequential active learning for regression (ALR) optimally selects a small number of samples sequentially from a large pool of unlabeled samples to label, so that a more accurate regression model can be constructed under a given labeling budget. Representativeness and diversity, which involve computing the distances among different samples, are important considerations in ALR. However, previous ALR approaches do not incorporate the importance of different features in inter-sample distance computation, resulting in sub-optimal sample selection. This paper proposes three feature weighted single-task ALR approaches and two feature weighted multi-task ALR approaches, where the ridge regression coefficients trained from a small amount of previously labeled samples are used to weight the corresponding features in inter-sample distance computation. Experiments showed that this easy-to-implement enhancement almost always improves the performance of four existing ALR approaches, in both single-task and multi-task regression problems. The feature weighting strategy may also be easily extended to stream-based ALR, and classification algorithms.
☆ Demographic Parity Tails for Regression
Demographic parity (DP) is a widely studied fairness criterion in regression, enforcing independence between the predictions and sensitive attributes. However, constraining the entire distribution can degrade predictive accuracy and may be unnecessary for many applications, where fairness concerns are localized to specific regions of the distribution. To overcome this issue, we propose a new framework for regression under DP that focuses on the tails of target distribution across sensitive groups. Our methodology builds on optimal transport theory. By enforcing fairness constraints only over targeted regions of the distribution, our approach enables more nuanced and context-sensitive interventions. Leveraging recent advances, we develop an interpretable and flexible algorithm that leverages the geometric structure of optimal transport. We provide theoretical guarantees, including risk bounds and fairness properties, and validate the method through experiments in regression settings.
☆ Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning
Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.
comment: 20 pages, 4 tables, 6 figures, appendix included
☆ Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
comment: Project Website: https://world-action-verifier.github.io
☆ Homogenized Transformers
We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.
☆ RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
comment: 11 pages, 10 figures. To be submitted to CAMLIS 2026
☆ Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).
☆ Generalization Bounds and Statistical Guarantees for Multi-Task and Multiple Operator Learning with MNO Networks
Multiple operator learning concerns learning operator families $\{G[α]:U\to V\}_{α\in W}$ indexed by an operator descriptor $α$. Training data are collected hierarchically by sampling operator instances $α$, then input functions $u$ per instance, and finally evaluation points $x$ per input, yielding noisy observations of $G[α][u](x)$. While recent work has developed expressive multi-task and multiple operator learning architectures and approximation-theoretic scaling laws, quantitative statistical generalization guarantees remain limited. We provide a covering-number-based generalization analysis for separable models, focusing on the Multiple Neural Operator (MNO) architecture: we first derive explicit metric-entropy bounds for hypothesis classes given by linear combinations of products of deep ReLU subnetworks, and then combine these complexity bounds with approximation guarantees for MNO to obtain an explicit approximation-estimation tradeoff for the expected test error on new (unseen) triples $(α,u,x)$. The resulting bound makes the dependence on the hierarchical sampling budgets $(n_α,n_u,n_x)$ transparent and yields an explicit learning-rate statement in the operator-sampling budget $n_α$, providing a sample-complexity characterization for generalization across operator instances. The structure and architecture can also be viewed as a general purpose solver or an example of a "small'' PDE foundation model, where the triples are one form of multi-modality.
☆ Learn by Surprise, Commit by Proof
We propose LSCP, a self-gated post-training framework for autonomous knowledge acquisition: learning only what a model does not already know, verified against what it does know, at a strength proportional to conviction, with no external oracle. When a passage produces anomalously high per-token loss, LSCP flags it, generates a Q&A chain that forces the model to articulate its own knowledge and identify gaps, then adjusts AdamW's $β_2$ proportionally to conviction depth k (the number of self-verification steps the passage survives) via $β_2 = 0.999 \cdot r^k$. The entire learning intensity is governed by a single parameter $r$. Beyond new knowledge, this process sharpens weakly encoded existing knowledge, which is a primary source of hallucination. The framework is self-extinguishing: as the model learns, per-token loss on learned passages decreases toward the surprisal threshold and the system progressively converges to standard AdamW. This models biological memory consolidation: temporary information in the context window is selectively consolidated into parametric weights, the model's long-term memory. Experiments on the reference model (Qwen3-14B) and across six models (8B--32B, four families) show that standard fine-tuning produces rote memorization (perturbation gap (the ratio of paraphrase to original perplexity) of 11.6 +- 0.2 x baseline) while all LSCP conditions learn semantically (2.7--3.0x). The r=1.0 condition (identical optimizer, nearly identical data, only Q&A format differs) confirms that the training data format, not $β_2$ gating, is the primary mechanism preventing memorization; gating instead protects neighboring knowledge from contamination by corrupt content (93 +- 7% accuracy on adjacent questions at r=0.98 vs. 90% baseline).
comment: 24 pages, 3 figures
☆ annbatch unlocks terabyte-scale training of biological data in anndata
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch
☆ PAC-Bayesian Reward-Certified Outcome Weighted Learning
Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.
☆ Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction
Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of $T$ snapshots and $F$ frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to $O(TF^2 + FT^2)$. Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching $ρ\geq 0.82$ compared to $ρ\geq 0.62$ for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.
comment: 6 pages, 6 figures
☆ A Novel Theoretical Analysis for Clustering Heteroscedastic Gaussian Data without Knowledge of the Number of Clusters
This paper addresses the problem of clustering measurement vectors that are heteroscedastic in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different and unknown covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-points of the identified function tend to be the cluster centroids if both the number of measurements per cluster and the distances between centroids are large enough. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows us to derive a new clustering algorithm called CENTRE-X that works by estimating the fixed-points of the identified function. As Mean-Shift, CENTRE-X requires no prior knowledge of the number of clusters. It relies on a Wald hypothesis test to significantly reduce the number of fixed points to calculate compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real data sets show that CENTRE-X has comparable or better performance than standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.
comment: 76 pages, submitted to JMLR
☆ Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution
We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra's algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.
☆ Woosh: A Sound Effects Foundation Model
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
☆ Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
☆ The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning ICLR
Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end underexplored.To address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the $Θ(\frac{1}{k})$ decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay -- a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of \methodName upon TD3, \myadded{Double DQN} and SAC with SimBa architecture in MuJoCo, \myadded{ALE} and DeepMind Control Suite tasks. The results demonstrate that \methodName effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.
comment: ICLR
☆ Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling
Artificial intelligence (AI) systems accelerate medical workflows and improve diagnostic accuracy in healthcare, serving as second-opinion systems. However, the unpredictability of AI errors poses a significant challenge, particularly in healthcare contexts, where mistakes can have severe consequences. A widely adopted safeguard is to pair predictions with uncertainty estimation, enabling human experts to focus on high-risk cases while streamlining routine verification. Current uncertainty estimation methods, however, remain limited, particularly in quantifying aleatoric uncertainty, which arises from data ambiguity and noise. To address this, we propose a novel approach that leverages disagreement in expert responses to generate targets for training machine learning models. These targets are used in conjunction with standard data labels to estimate two components of uncertainty separately, as given by the law of total variance, via a two-ensemble approach, as well as its lightweight variant. We validate our method on binary image classification, binary and multi-class image segmentation, and multiple-choice question answering. Our experiments demonstrate that incorporating expert knowledge can enhance uncertainty estimation quality by $9\%$ to $50\%$ depending on the task, making this source of information invaluable for the construction of risk-aware AI systems in healthcare applications.
☆ LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding
Electroencephalography (EEG) provides a non-invasive window into brain activity, offering high temporal resolution crucial for understanding and interacting with neural processes through brain-computer interfaces (BCIs). Current dual-stream neural networks for EEG often process temporal and spatial features independently through parallel branches, delaying their integration until a final, late-stage fusion. This design inherently leads to an "information silo" problem, precluding intermediate cross-stream refinement and hindering spatial-temporal decompositions essential for full feature utilization. We propose LI-DSN, a layer-wise interactive dual-stream network that facilitates progressive, cross-stream communication at each layer, thereby overcoming the limitations of late-fusion paradigms. LI-DSN introduces a novel Temporal-Spatial Integration Attention (TSIA) mechanism, which constructs a Spatial Affinity Correlation Matrix (SACM) to capture inter-electrode spatial structural relationships and a Temporal Channel Aggregation Matrix (TCAM) to integrate cosine-gated temporal dynamics under spatial guidance. Furthermore, we employ an adaptive fusion strategy with learnable channel weights to optimize the integration of dual-stream features. Extensive experiments across eight diverse EEG datasets, encompassing motor imagery (MI) classification, emotion recognition, and steady-state visual evoked potentials (SSVEP), consistently demonstrate that LI-DSN significantly outperforms 13 state-of-the-art (SOTA) baseline models, showcasing its superior robustness and decoding performance. The code will be publicized after acceptance.
☆ DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)
Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.
comment: 30 pages, 5 figures. Submitted to Neural Networks (Elsevier)
☆ Robust Graph Representation Learning via Adaptive Spectral Contrast
Spectral graph contrastive learning has emerged as a unified paradigm for handling both homophilic and heterophilic graphs by leveraging high-frequency components. However, we identify a fundamental spectral dilemma: while high-frequency signals are indispensable for encoding heterophily, our theoretical analysis proves they exhibit significantly higher variance under spectrally concentrated perturbations. We derive a regret lower bound showing that existing global (node-agnostic) spectral fusion is provably sub-optimal: on mixed graphs with separated node-wise frequency preferences, any global fusion strategy incurs non-vanishing regret relative to a node-wise oracle. To escape this bound, we propose ASPECT, a framework that resolves this dilemma through a reliability-aware spectral gating mechanism. Formulated as a minimax game, ASPECT employs a node-wise gate that dynamically re-weights frequency channels based on their stability against a purpose-built adversary, which explicitly targets spectral energy distributions via a Rayleigh quotient penalty. This design forces the encoder to learn representations that are both structurally discriminative and spectrally robust. Empirical results show that ASPECT achieves new state-of-the-art performance on 8 out of 9 benchmarks, effectively decoupling meaningful structural heterophily from incidental noise.
☆ Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
In modern process industries, data-driven models are important tools for real-time monitoring when key performance indicators are difficult to measure directly. While accurate predictions are essential, reliable uncertainty quantification (UQ) is equally critical for safety, reliability, and decision-making, but remains a major challenge in current data-driven approaches. In this work, we introduce a diffusion-based posterior sampling framework that inherently produces well-calibrated predictive uncertainty via faithful posterior sampling, eliminating the need for post-hoc calibration. In extensive evaluations on synthetic distributions, the Raman-based phenylacetic acid soft sensor benchmark, and a real ammonia synthesis case study, our method achieves practical improvements over existing UQ techniques in both uncertainty calibration and predictive accuracy. These results highlight diffusion samplers as a principled and scalable paradigm for advancing uncertainty-aware modeling in industrial applications.
comment: This manuscript has been accepted for publication in IEEE Transactions on Industrial Informatics. Copyright has been transferred to IEEE. Reuse of this material is subject to IEEE copyright restrictions
☆ CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift AAAI 2026
Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.
comment: AAAI 2026
☆ Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
comment: 15 pages plus references; 5 figures; supplementary appended; accepted to ICPR 2026
☆ Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
☆ Physics Informed Reinforcement Learning with Gibbs Priors for Topology Control in Power Grids
Topology control for power grid operation is a challenging sequential decision making problem because the action space grows combinatorially with the size of the grid and action evaluation through simulation is computationally expensive. We propose a physics-informed Reinforcement Learning framework that combines semi-Markov control with a Gibbs prior, that encodes the system's physics, over the action space. The decision is only taken when the grid enters a hazardous regime, while a graph neural network surrogate predicts the post action overload risk of feasible topology actions. These predictions are used to construct a physics-informed Gibbs prior that both selects a small state-dependent candidate set and reweights policy logits before action selection. In this way, our method reduces exploration difficulty and online simulation cost while preserving the flexibility of a learned policy. We evaluate the approach in three realistic benchmark environments of increasing difficulty. Across all settings, the proposed method achieves a strong balance between control quality and computational efficiency: it matches oracle-level performance while being approximately $6\times$ faster on the first benchmark, reaches $94.6\%$ of oracle reward with roughly $200\times$ lower decision time on the second one, and on the most challenging benchmark improves over a PPO baseline by up to $255\%$ in reward and $284\%$ in survived steps while remaining about $2.5\times$ faster than a strong specialized engineering baseline. These results show that our method provides an effective mechanism for topology control in power grids.
☆ Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Virtual Sensing on Irregular Grids
Accurate sensing of spatially distributed physical fields typically requires dense instrumentation, which is often infeasible in real-world systems due to cost, accessibility, and environmental constraints. Physics-based solvers address this through direct numerical integration of governing equations, but their computational latency and power requirements preclude real-time use in resource-constrained monitoring and control systems. Here we introduce VIRSO (Virtual Irregular Real-Time Sparse Operator), a graph-based neural operator for sparse-to-dense reconstruction on irregular geometries, and a variable-connectivity algorithm, Variable KNN (V-KNN), for mesh-informed graph construction. Unlike prior neural operators that treat hardware deployability as secondary, VIRSO reframes inference as measurement: the combination of both spectral and spatial analysis provides accurate reconstruction without the high latency and power consumption of previous graph-based methodologies with poor scalability, presenting VIRSO as a potential candidate for edge-constrained, real-time virtual sensing. We evaluate VIRSO on three nuclear thermal-hydraulic benchmarks of increasing geometric and multiphysics complexity, across reconstruction ratios from 47:1 to 156:1. VIRSO achieves mean relative $L_2$ errors below 1%, outperforming other benchmark operators while using fewer parameters. The full 10-layer configuration reduces the energy-delay product (EDP) from ${\approx}206$ J$\cdot$ms for the graph operator baseline to $10.1$ J$\cdot$ms on an NVIDIA H200. Implemented on an NVIDIA Jetson Orin Nano, all configurations of VIRSO provide sub-10 W power consumption and sub-second latency. These results establish the edge-feasibility and hardware-portability of VIRSO and present compute-aware operator learning as a new paradigm for real-time sensing in inaccessible and resource-constrained environments.
comment: 34 pages, 5 figures, 16 tables
☆ Learning in Prophet Inequalities with Noisy Observations ICLR 2026
We study the prophet inequality, a fundamental problem in online decision-making and optimal stopping, in a practical setting where rewards are observed only through noisy realizations and reward distributions are unknown. At each stage, the decision-maker receives a noisy reward whose true value follows a linear model with an unknown latent parameter, and observes a feature vector drawn from a distribution. To address this challenge, we propose algorithms that integrate learning and decision-making via lower-confidence-bound (LCB) thresholding. In the i.i.d.\ setting, we establish that both an Explore-then-Decide strategy and an $\varepsilon$-Greedy variant achieve the sharp competitive ratio of $1 - 1/e$, under a mild condition on the optimal value. For non-identical distributions, we show that a competitive ratio of $1/2$ can be guaranteed against a relaxed benchmark. Moreover, with limited window access to past rewards, the tight ratio of $1/2$ against the optimal benchmark is achieved.
comment: ICLR 2026
☆ Bridging Deep Learning and Integer Linear Programming: A Predictive-to-Prescriptive Framework for Supply Chain Analytics
Although demand forecasting is a critical component of supply chain planning, actual retail data can exhibit irreconcilable seasonality, irregular spikes, and noise, rendering precise projections nearly unattainable. This paper proposes a three-step analytical framework that combines forecasting and operational analytics. The first stage consists of exploratory data analysis, where delivery-tracked data from 180,519 transactions are partitioned, and long-term trends, seasonality, and delivery-related attributes are examined. Secondly, the forecasting performance of a statistical time series decomposition model N-BEATS MSTL and a recent deep learning architecture N-HiTS were compared. N-BEATS and N-HiTS were both statistically, and hence were N-BEATS's and N-HiTS's statistically selected. Most recent time series deep learning models, N-HiTS, N-BEATS. N-HiTS and N-BEATS N-HiTS and N-HiTS outperformed the statistical benchmark to a large extent. N-BEATS was selected to be the most optimized model, as the one with the lowest forecasting error, in the 3rd and final stage forecasting values of the next 4 weeks of 1918 units, and provided those as a model with a set of deterministically integer linear program outcomes that are aimed to minimize the total delivery time with a set of bound budget, capacity, and service constraints. The solution allocation provided a feasible and cost-optimal shipping plan. Overall, the study provides a compelling example of the practical impact of precise forecasting and simple, highly interpretable model optimization in logistics.
comment: 12 pages, 4 figures, 4 tables
☆ Dual-Attention Based 3D Channel Estimation
For multi-input and multi-output (MIMO) channels, the optimal channel estimation (CE) based on linear minimum mean square error (LMMSE) requires three-dimensional (3D) filtering. However, the complexity is often prohibitive due to large matrix dimensions. Suboptimal estimators approximate 3DCE by decomposing it into time, frequency, and spatial domains, while yields noticeable performance degradation under correlated MIMO channels. On the other hand, recent advances in deep learning (DL) can explore channel correlations in all domains via attention mechanisms. Building on this capability, we propose a dual attention mechanism based 3DCE network (3DCENet) that can achieve accurate estimates.
comment: 5 pages, 6 figures
☆ FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
comment: The first two authors contributed equally to this work; listing order is random
☆ LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
comment: Project page: https://livemathematicianbench.github.io/
☆ DDCL: Deep Dual Competitive Learning: A Differentiable End-to-End Framework for Unsupervised Prototype-Based Representation Learning
A persistent structural weakness in deep clustering is the disconnect between feature learning and cluster assignment. Most architectures invoke an external clustering step, typically k-means, to produce pseudo-labels that guide training, preventing the backbone from directly optimising for cluster quality. This paper introduces Deep Dual Competitive Learning (DDCL), the first fully differentiable end-to-end framework for unsupervised prototype-based representation learning. The core contribution is architectural: the external k-means is replaced by an internal Dual Competitive Layer (DCL) that generates prototypes as native differentiable outputs of the network. This single inversion makes the complete pipeline, from backbone feature extraction through prototype generation to soft cluster assignment, trainable by backpropagation through a single unified loss, with no Lloyd iterations, no pseudo-label discretisation, and no external clustering step. To ground the framework theoretically, the paper derives an exact algebraic decomposition of the soft quantisation loss into a simplex-constrained reconstruction error and a non-negative weighted prototype variance term. This identity reveals a self-regulating mechanism built into the loss geometry: the gradient of the variance term acts as an implicit separation force that resists prototype collapse without any auxiliary objective, and leads to a global Lyapunov stability theorem for the reduced frozen-encoder system. Six blocks of controlled experiments validate each structural prediction. The decomposition identity holds with zero violations across more than one hundred thousand training epochs; the negative feedback cycle is confirmed with Pearson -0.98; with a jointly trained backbone, DDCL outperforms its non-differentiable ablation by 65% in clustering accuracy and DeepCluster end-to-end by 122%.
☆ Koopman-Based Nonlinear Identification and Adaptive Control of a Turbofan Engine
This paper investigates Koopman operator-based approaches for multivariable control of a two-spool turbofan engine. A physics-based component-level model is developed to generate training data and validate the controllers. A meta-heuristic extended dynamic mode decomposition is developed, with a cost function designed to accurately capture both spool-speed dynamics and the engine pressure ratio (EPR), enabling the construction of a single Koopman model suitable for multiple control objectives. Using the identified time-varying Koopman model, two controllers are developed: an adaptive Koopman-based model predictive controller (AKMPC) with a disturbance observer and a Koopman-based feedback linearization controller (K-FBLC), which serves as a benchmark. The controllers are evaluated for two control strategies, namely configurations of spool speeds and EPR, under both sea-level and varying flight conditions. The results demonstrate that the proposed identification approach enables accurate predictions of both spool speeds and EPR, allowing the Koopman model to be reused flexibly across different control formulations. While both control strategies achieve comparable performance in steady conditions, the AKMPC exhibits superior robustness compared with the K-FBLC under varying flight conditions due to its ability to compensate for model mismatch. Moreover, the EPR control strategy improves the thrust response. The study highlights the applicability of Koopman-based control and demonstrates the advantages of the AKMPC-based framework for robust turbofan engine control.
comment: 21 pages, 23 figures
☆ MATA-Former & SIICU: Semantic Aware Temporal Alignment for High-Fidelity ICU Risk Prediction
Forecasting evolving clinical risks relies on intrinsic pathological dependencies rather than mere chronological proximity, yet current methods struggle with coarse binary supervision and physical timestamps. To align predictive modeling with clinical logic, we propose the Medical-semantics Aware Time-ALiBi Transformer (MATA-Former), utilizing event semantics to dynamically parameterize attention weights to prioritize causal validity over time lags. Furthermore, we introduce Plateau-Gaussian Soft Labeling (PSL), reformulating binary classification into continuous multi-horizon regression for full-trajectory risk modeling. Evaluated on SIICU -- a newly constructed dataset featuring over 506k events with rigorous expert-verified, fine-grained annotations -- and the MIMIC-IV dataset, our framework demonstrates superior efficacy and robust generalization in capturing risks from text-intensive, irregular clinical time series.
☆ LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.
☆ Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.
comment: 21 pages, 22 figures, 9 tables. This version corresponds to the published article in Computers & Structures. https://doi.org/10.1016/j.compstruc.2026.108216
♻ ☆ LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/.
♻ ☆ Intervening to Learn and Compose Causally Disentangled Representations
In designing generative models, it is commonly believed that in order to learn useful latent structure, we face a fundamental tension between expressivity and structure. In this paper we challenge this view by proposing a new approach to training arbitrarily expressive generative models that simultaneously learn causally disentangled concepts. This is accomplished by adding a simple context module to an arbitrarily complex black-box model, which learns to process concept information by implicitly inverting linear representations from the model's encoder. Inspired by the notion of intervention in a causal model, our module selectively modifies its architecture during training, allowing it to learn a compact joint model over different contexts. We show how adding this module leads to causally disentangled representations that can be composed for out-of-distribution generation on both real and simulated data. The resulting models can be trained end-to-end or fine-tuned from pre-trained models. To further validate our proposed approach, we prove a new identifiability result that extends existing work on identifying structured representations.
comment: 45 pages, 10 figures; accepted to the 5th conference on Causal Learning and Reasoning (CLeaR)
♻ ☆ Transformers Can Solve Non-Linear and Non-Markovian Filtering Problems in Continuous Time For Conditionally Gaussian Signals
The use of attention-based deep learning models in stochastic filtering, e.g. transformers and deep Kalman filters, has recently come into focus; however, the potential for these models to solve stochastic filtering problems remains largely unknown. The paper provides an affirmative answer to this open problem in the theoretical foundations of machine learning by showing that a class of continuous-time transformer models, called \textit{filterformers}, can approximately implement the conditional law of a broad class of non-Markovian and conditionally Gaussian signal processes given noisy continuous-time (possibly non-Gaussian) measurements. Our approximation guarantees hold uniformly over sufficiently regular compact subsets of continuous-time paths, where the worst-case 2-Wasserstein distance between the true optimal filter and our deep learning model quantifies the approximation error. Our construction relies on two new customizations of the standard attention mechanism: The first can losslessly adapt to the characteristics of a broad range of paths since we show that the attention mechanism implements bi-Lipschitz embeddings of sufficiently regular sets of paths into low-dimensional Euclidean spaces; thus, it incurs no ``dimension reduction error''. The latter attention mechanism is tailored to the geometry of Gaussian measures in the $2$-Wasserstein space. Our analysis relies on new stability estimates of robust optimal filters in the conditionally Gaussian setting.
♻ ☆ Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
comment: Visualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/
♻ ☆ Hybrid Hidden Markov Model for Modeling Equity Excess Growth Rate Dynamics: A Discrete-State Approach with Jump-Diffusion
Generating synthetic financial time series that preserve the statistical properties of real market data is essential for stress testing, risk model validation, and scenario design. Existing approaches struggle to simultaneously reproduce heavy-tailed distributions, negligible linear autocorrelation, and persistent volatility clustering. We developed a hybrid hidden Markov framework that discretized excess growth rates into Laplace quantile-defined states and augmented regime switching with a Poisson jump-duration mechanism to enforce realistic tail-state dwell times. Parameters were estimated by direct transition counting, bypassing the Baum-Welch EM algorithm and scaling to a 424-asset pipeline. Applied to ten years of daily equity data, the framework achieved high distributional pass rates both in-sample and out-of-sample while partially reproducing the volatility clustering that standard regime-switching models miss. No single model was best at everything: GARCH(1,1) better reproduced volatility clustering but failed distributional tests, while the standard HMM without jumps passed more distributional tests but could not generate volatility clustering. The proposed framework delivered the most balanced performance overall. For multi-asset generation, copula-based dependence models that preserved each asset's marginal HMM distribution substantially outperformed a Single-Index Model factor baseline on both per-asset distributional accuracy and correlation reproduction.
♻ ☆ Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational resources, particularly under increasing energy and infrastructure constraints. GPUs have emerged as essential for accelerating such workloads. This study benchmarks four deep learning models (Conv6, VGG16, ResNet18, CycleGAN) using TensorFlow and PyTorch on Intel Xeon CPUs and NVIDIA Tesla T4 GPUs. Our experiments demonstrate that, on average, GPU training achieves speedups ranging from 11x to 246x depending on model complexity, with lightweight models (Conv6) showing the highest acceleration (246x), mid-sized models (VGG16, ResNet18) achieving 51-116x speedups, and complex generative models (CycleGAN) reaching 11x improvements compared to CPU training. Additionally, in our PyTorch vs. TensorFlow comparison, we observed that TensorFlow's kernel-fusion optimizations reduce inference latency by approximately 15%. We also analyze GPU memory usage trends and projecting requirements through 2025 using polynomial regression. Our findings highlight that while GPUs are essential for sustaining AI's growth, democratized and shared access to GPU resources is critical for enabling research innovation across institutions with limited computational budgets.
♻ ☆ Moonwalk: Inverse-Forward Differentiation
Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.
♻ ☆ NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
♻ ☆ Property-Level Flood Risk Assessment Using AI-Enabled Street-View Lowest Floor Elevation Extraction and ML Imputation Across Texas
This paper argues that AI-enabled analysis of street-view imagery, complemented by performance-gated machine-learning imputation, provides a viable pathway for generating building-specific elevation data at regional scale for flood risk assessment. We develop and apply a three-stage pipeline across 18 areas of interest (AOIs) in Texas that (1) extracts LFE and the height difference between street grade and the lowest floor (HDSL) from Google Street View imagery using the Elev-Vision framework, (2) imputes missing HDSL values with Random Forest and Gradient Boosting models trained on 16 terrain, hydrologic, geographic, and flood-exposure features, and (3) integrates the resulting elevation dataset with Fathom 1-in-100 year inundation surfaces and USACE depth-damage functions to estimate property-specific interior flood depth and expected loss. Across 12,241 residential structures, street-view imagery was available for 73.4% of parcels and direct LFE/HDSL extraction was successful for 49.0% (5,992 structures). Imputation was retained for 13 AOIs where cross-validated performance was defensible, with selected models achieving R suqre values from 0.159 to 0.974; five AOIs were explicitly excluded from prediction because performance was insufficient. The results show that street-view-based elevation mapping is not universally available for every property, but it is sufficiently scalable to materially improve regional flood-risk characterization by moving beyond hazard exposure to structure-level estimates of interior inundation and expected damage. Scientifically, the study advances LFE estimation from a pilot-scale proof of concept to a regional, end-to-end workflow. Practically, it offers a replicable framework for jurisdictions that lack comprehensive Elevation Certificates but need parcel-level information to support mitigation, planning, and flood-risk management.
♻ ☆ BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Interpreting gene clusters from RNA sequencing (RNA-seq) remains challenging, especially in antimicrobial resistance studies where mechanistic insight is important for hypothesis generation. Existing pathway enrichment methods can summarize co-expressed modules, but they often provide limited cluster-specific explanations and weak connections to supporting literature. We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules. BIOGEN combines biomedical retrieval, structured reasoning, and multi-critic verification to generate traceable cluster-level explanations with explicit evidence and confidence labels. On a primary Salmonella enterica dataset, BIOGEN achieved strong biological grounding, including BERTScore 0.689, Semantic Alignment Score 0.715, KEGG Functional Similarity 0.342, and a hallucination rate of 0.000, compared with 0.100 for an LLM-only baseline. Across four additional bacterial RNA-seq datasets, BIOGEN also maintained zero hallucination under the same fixed pipeline. In comparisons with representative open-source agentic AI baselines, BIOGEN was the only framework that consistently preserved zero hallucination across all five datasets. These findings suggest that retrieval alone is not enough for reliable biological interpretation, and that evidence-grounded orchestration is important for transparent and source-traceable transcriptomic reasoning.
♻ ☆ Generalized Machine Learning for Fast Calibration of Agent-Based Epidemic Models
Agent-based models (ABMs) are widely used to study infectious disease dynamics, but their calibration is often computationally intensive, limiting their applicability in time-sensitive public health settings. We propose DeepIMC (Deep Inverse Mapping Calibration), a machine learning-based calibration framework that directly learns the inverse mapping from epidemic time series to epidemiological parameters. DeepIMC trains a bidirectional Long Short-Term Memory (BiLSTM) neural network on synthetic epidemic trajectories generated from agent-based models such as the Susceptible-Infected-Recovered (SIR) model, enabling rapid parameter estimation without repeated simulation at inference time. We evaluate DeepIMC through an extensive simulation study comprising 5,000 heterogeneous epidemic scenarios and benchmark its performance against Approximate Bayesian Computation (ABC) using likelihood-free Markov Chain Monte Carlo. The results show that DeepIMC substantially improves parameter recovery accuracy, produces sharp and well-calibrated predictive intervals, and reduces computational time by more than an order of magnitude relative to ABC. Although structural parameter identifiability constraints limit the precise recovery of all model parameters simultaneously, the calibrated models reliably reproduce epidemic trajectories and support accurate forward prediction with their estimated parameters. DeepIMC is implemented in the open-source R package epiworldRCalibrate, facilitating practical adoption for real-time epidemic modeling and policy analysis. Overall, our findings demonstrate that DeepIMC provides a scalable, operationally effective alternative to traditional simulation-based calibration methods for agent-based epidemic models.
♻ ☆ One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually rely on high-quality samples of large volumes. In this paper, we challenge conventional assumptions about data requirements in RL for LLMs by demonstrating the effectiveness of one-shot reinforcement learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary reasoning improvement. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology; (2) Analysis of salient mathematical skills provides insight into the characteristics associated with effective polymath samples; and (3) An engineered synthetic sample that integrates multidisciplinary elements and broader skill coverage achieves stronger performance than naturally occurring individual samples. Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of samples that complements simply increasing data volume.
♻ ☆ Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an $\varepsilon$-stationary point at a rate of $O(\varepsilon^{-4})$, matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.
♻ ☆ Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization
Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.
♻ ☆ The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery
Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.
♻ ☆ Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $Θ$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to Θ$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
♻ ☆ Zeroth-order Logconcave Sampling
We study the zeroth-order query complexity of sampling from a general logconcave distribution: given access to an evaluation oracle for a convex function $V:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{\infty\}$, output a point from a distribution within $\varepsilon$-distance to the density proportional to $e^{-V}$. A long line of work provides efficient algorithms for this problem in TV distance, assuming a pointwise warm start (i.e., in $\infty$-Rényi divergence), and using annealing to generate such a warm start. Here, we address the natural and more general problem of using a $q$-Rényi divergence warm start to generate a sample that is $\varepsilon$-close in $q$-Rényi divergence. Our first main result is an algorithm with this end-to-end guarantee with state-of-the-art complexity for $q=\widetildeΩ(1)$. Our second result shows how to generate a $q$-Rényi divergence warm start directly via annealing, by maintaining $q$-Rényi divergence throughout, thereby obtaining a streamlined analysis and improved complexity. Such results were previously known only under the stronger assumptions of smoothness and access to first-order oracles. We also show a lower bound for Gaussian annealing by disproving a geometric conjecture about quadratic tilts of isotropic logconcave distributions. Central to our approach, we establish hypercontractivity of the heat adjoint and translate this into improved mixing time guarantees for the Proximal Sampler. The resulting analysis of both sampling and annealing follows a simplified and natural path, directly tying convergence rates to isoperimetric constants of the target distribution.
comment: v2: Fix a bug in the restart mechanism; add a lower bound on Gaussian annealing
Olaf: Bringing an Animated Character to Life in the Physical World
Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
♻ ☆ Multi-Timescale Primal Dual Hybrid Gradient with Application to Distributed Optimization
We propose two variants of the Primal Dual Hybrid Gradient (PDHG) algorithm for saddle point problems with block decomposable duals, hereafter called Multi-Timescale PDHG (MT-PDHG) and its accelerated variant (AMT-PDHG). Through novel mixtures of Bregman divergence and multi-timescale extrapolations, our MT-PDHG and AMT-PDHG converge under arbitrary updating rates for different dual blocks while remaining fully deterministic and robust to extreme delays in dual updates. We further apply our (A)MT-PDHG, augmented with the gradient sliding techniques introduced in Lan et al. (2020), Lan (2016), to distributed optimization. The flexibility in choosing different updating rates for different blocks allows a more refined control over the communication rounds between different pairs of agents, thereby improving the efficiencies in settings with heterogeneity in local objectives and communication costs. Moreover, with careful choices of penalty levels, our algorithms show linear and thus optimal dependency on function similarities, a measure of how similar the gradients of local objectives are. This provides a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).
♻ ☆ Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
♻ ☆ Fragility-aware Classification for Understanding Risk and Improving Generalization
Classification models play a central role in data-driven decision-making applications such as medical diagnosis, recommendation systems, and risk assessment. Traditional performance metrics, such as accuracy and AUC, focus on overall error rates but fail to account for the confidence of incorrect predictions, i.e., the risk of confident misjudgments. This limitation is particularly consequential in safety-critical and cost-sensitive settings, where overconfident errors can lead to severe outcomes. To address this issue, we propose the Fragility Index (FI), a novel performance metric that evaluates classifiers from a risk-averse perspective by capturing the tail risk of confident misjudgments. We formulate FI within a robust satisficing (RS) framework to ensure robustness under distributional uncertainty. Building on this, we develop a tractable training framework that directly targets FI via a surrogate loss, and show that models trained under this framework admit provable bounds on FI. We further derive exact reformulations for a broad class of loss functions, including cross-entropy, hinge-type, and Lipschitz losses, and extend the approach to deep neural networks. Empirical results on real-world medical diagnosis tasks demonstrate that FI complements existing metrics by revealing error tail risk and improving decision quality. FI-based models achieve competitive accuracy and AUC while consistently reducing confident misjudgments and associated operational costs, offering a practical tool for improving robustness and reliability in risk-critical applications.
♻ ☆ A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks
Nonlinear partial differential equations (PDEs) are pivotal in modeling complex physical systems, yet traditional Physics-Informed Neural Networks (PINNs) often struggle with unresolved residuals in critical spatiotemporal regions and violations of temporal causality. To address these limitations, we propose a novel Residual Guided Training strategy for Physics-Informed Transformer via Generative Adversarial Networks (GAN). Our framework integrates a decoder-only Transformer to inherently capture temporal correlations through autoregressive processing, coupled with a residual-aware GAN that dynamically identifies and prioritizes high-residual regions. By introducing a causal penalty term and an adaptive sampling mechanism, the method enforces temporal causality while refining accuracy in problematic domains. Extensive numerical experiments on the Allen-Cahn, Klein-Gordon, and Navier-Stokes equations demonstrate significant improvements, achieving relative MSE reductions of up to three orders of magnitude compared to baseline methods. This work bridges the gap between deep learning and physics-driven modeling, offering a robust solution for multiscale and time-dependent PDE systems.
♻ ☆ TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology ICML 2025
Understanding the biological mechanisms of disease is crucial for medicine, and in particular, for drug discovery. AI-powered analysis of genome-scale biological data holds great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving single-cell foundation models. First, we scaled the pre-training data to a diverse collection of 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the \model family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on several downstream evaluation tasks, including identifying the underlying disease state of held-out donors not seen during training, distinguishing between diseased and healthy cells for disease conditions and donors not seen during training, and probing the learned representations for known biology. Our models showed substantial improvement over existing works, and scaling experiments showed that performance improved predictably with both data volume and parameter count.
comment: ICML 2025 Generative AI and Biology (GenBio) Workshop
♻ ☆ Multigrade Neural Network Approximation
We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly non-convex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably one-hidden-layer $\texttt{ReLU}$ models, training admits convex reformulations with global guarantees, motivating learning paradigms that improve stability while scaling to depth. MGDL builds upon this insight by training deep networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce the remaining approximation error, yielding an interpretable and stable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function, there exists a fixed-width multigrade $\texttt{ReLU}$ scheme whose residuals decrease strictly across grades and converge uniformly to zero. To the best of our knowledge, this work provides the first rigorous theoretical guarantee that grade-wise training yields provable vanishing approximation error in deep networks. Numerical experiments further illustrate the theoretical results.
♻ ☆ Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.
♻ ☆ Causal K-Means Clustering
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.
♻ ☆ Learning to Play Blackjack: A Curriculum Learning Perspective
Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.
comment: Accepted as an oral presentation at the International Conference on Distributed Artificial Intelligence (DAI 2025). 16 pages, 7 figures
♻ ☆ A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations
Scientific machine learning is an emerging field that broadly describes the combination of scientific computing and machine learning to address challenges in science and engineering. Within the context of differential equations, this has produced highly influential methods, such as neural ordinary differential equations (NODEs). Recent works extend this line of research to consider neural differential-algebraic systems of equations (DAEs), where some unknown relationships within the DAE are learned from data. Training neural DAEs, similarly to neural ODEs, is computationally expensive, as it requires the solution of a DAE for every parameter update. Further, the rigorous consideration of algebraic constraints is difficult within common deep learning training algorithms such as stochastic gradient descent. In this work, we apply the simultaneous approach to neural DAE problems, resulting in a fully discretized nonlinear optimization problem, which is solved to local optimality and simultaneously obtains the neural network parameters and the solution to the corresponding DAE. We extend recent work demonstrating the simultaneous approach for neural ODEs, by presenting a general framework to solve neural DAEs, with explicit consideration of hybrid models, where some components of the DAE are known, e.g. physics-informed constraints. Furthermore, we present a general strategy for improving the performance and convergence of the nonlinear programming solver, based on solving an auxiliary problem for initialization and approximating Hessian terms. We achieve promising results in terms of accuracy, model generalizability and computational cost, across different problem settings such as sparse data, unobserved states and multiple trajectories. Lastly, we provide several promising future directions to improve the scalability and robustness of our approach.
♻ ☆ Prognostics for Autonomous Deep-Space Habitat Health Management under Multiple Unknown Failure Modes
Deep-space habitats (DSHs) are safety-critical systems that must operate autonomously for long periods, often beyond the reach of ground-based maintenance or expert intervention. Monitoring system health and anticipating failures are therefore essential. Prognostics based on remaining useful life (RUL) prediction support this goal by estimating how long a subsystem can operate before failure. Critical DSH subsystems, including environmental control and life support, power generation, and thermal control, are monitored by many sensors and can degrade through multiple failure modes. These failure modes are often unknown, and informative sensors may vary across modes, making accurate RUL prediction challenging when historical failure data are unlabeled. We propose an unsupervised prognostics framework for RUL prediction that jointly identifies latent failure modes and selects informative sensors using unlabeled run-to-failure data. The framework consists of two phases: an offline phase, where system failure times are modeled using a mixture of Gaussian regressions and an Expectation-Maximization algorithm to cluster degradation trajectories and select mode-specific sensors, and an online phase for real-time diagnosis and RUL prediction using low-dimensional features and a weighted functional regression model. The approach is validated on simulated DSH telemetry data and the NASA C-MAPSS benchmark, demonstrating improved prediction accuracy and interpretability.
comment: Manuscript under review
♻ ☆ CompressedScaffnew: The First Theoretical Double Acceleration of Communication from Local Training and Compression in Distributed Optimization
In distributed optimization, a large number of machines alternate between local computations and communication with a coordinating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce this burden and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose CompressedScaffnew, the first algorithm for distributed optimization that jointly harnesses these two strategies and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
♻ ☆ Tackling Non-IIDness in HAPS-Aided Federated Learning
High-altitude platform stations (HAPS) enable large-scale federated learning (FL) in non-terrestrial networks (NTN) by providing wide-area coverage and predominantly line-of-sight (LoS) connectivity to many ground users. However, practical deployments face heterogeneous and non-independently and identically distributed (non-IID) client data, which degrades accuracy and slows convergence. We propose a weighted attribute-based client selection strategy that leverages server-side indicators: historical traffic behavior, instantaneous channel quality, computational capability, and prior-round learning contribution. At each round, the HAPS computes a composite score and selects the top clients, while adapting attribute weights online based on their correlation with validation-loss improvement. We further provide theoretical justification that traffic-derived uniformity can serve as a proxy for latent data heterogeneity, enabling selection of client subsets with reduced expected non-IIDness. Simulations demonstrate improved test accuracy, faster convergence, and lower training loss compared with random, resource-only, and single-attribute baselines.
comment: Submitted to IEEE for possible publication
♻ ☆ Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.
♻ ☆ Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems
We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR's consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR's superior performance over baseline methods, achieving 7.6\%-29.3\% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.
comment: In systems with causal effects, a large majority of individuals are mistakenly classified as using a certain strategy by the strategic equilibrium solver, resulting in the introduction of this feature as an independent variable in causal inference without specificity. This method may have an inherent error
♻ ☆ DeDelayed: Deleting Remote Inference Delay via On-Device Correction CVPR 2026
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
comment: CVPR 2026
♻ ☆ SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of $14,116$ dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID ($\approx 65.5$) and KID ($\approx 0.05$), while diffusion models generated higher variance samples at the cost of reduced perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with $8$-$15$\% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1 $\approx 0.88$ and ROC-AUC $\approx 0.98$, representing an improvement of approximately $14\%$ over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench
♻ ☆ Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints
The ROC curve is widely used to assess binary classifiers. Yet for some applications, such as alert systems for monitoring hospitalized patients, conventional ROC analysis cannot meet two key deployment needs: enforcing a constraint on precision to avoid false alarm fatigue and imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet precision and capacity constraints occupy a feasible region in ROC space. We establish the polygon-shaped geometry of this region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible region. Averaging this area over a desired distribution for cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk from vital sign history on several datasets, we show this cost-aware metric can outperform alternatives at ranking classifiers for in-hospital alerts.
comment: In Proceedings of the International Conference of Artificial Intelligence and Statistics (AISTATS), 2026
♻ ☆ Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality
Deep generative models, while revolutionizing fields like image and text generation, largely operate as opaque ``black boxes'', hindering human understanding, control, and alignment. While methods like sparse autoencoders (SAEs) show remarkable empirical success, they often lack theoretical guarantees, risking subjective insights. Our primary objective is to establish a principled foundation for interpretable generative models. We demonstrate that the principle of causal minimality -- favoring the simplest causal explanation -- can endow the latent representations of modern generative models with clear causal interpretation and robust, component-wise identifiable control. We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables, better capturing the complex dependencies in data generation. Under theoretically derived minimality conditions, we show that learned representations can be equivalent to the true latent variables of the data-generating process. Empirically, applying these constraints to leading text-to-image diffusion models allows us to extract their innate hierarchical concept graphs, offering fresh insights into their internal knowledge organization. Furthermore, these causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.
♻ ☆ Linear Attention for Joint Power Optimization and User-Centric Clustering in Cell-Free Networks
Optimal AP clustering and power allocation are critical in user-centric cell-free massive MIMO systems. Existing deep learning models lack flexibility to handle dynamic network configurations. Furthermore, many approaches overlook pilot contamination and suffer from high computational complexity. In this paper, we propose a lightweight transformer model that overcomes these limitations by jointly predicting AP clusters and powers solely from spatial coordinates of user devices and AP. Our model is architecture-agnostic to users load, handles both clustering and power allocation without channel estimation overhead, and eliminates pilot contamination by assigning users to AP within a pilot reuse constraint. We also incorporate a customized linear attention mechanism to capture user-AP interactions efficiently and enable linear scalability with respect to the number of users. Numerical results confirm the model's effectiveness in maximizing the minimum spectral efficiency and providing near-optimal performance while ensuring adaptability and scalability in dynamic scenarios.
♻ ☆ HEAS: Hierarchical Evolutionary Agent-Based Simulation Framework for Multi-Objective Policy Search
Metric aggregation divergence is a hidden confound in agent-based model policy search: when optimization, tournament evaluation, and statistical validation independently implement outcome metric extraction, champion selection reflects aggregation artifact rather than policy quality. We propose Hierarchical Evolutionary Agent Simulation (HEAS), a composable framework that eliminates this confound through a runtime-enforceable metric contract - a uniform metrics_episode() callable shared identically by all pipeline stages. Removing the confound yields robust champion selection: in a controlled experiment (n=30), HEAS reduces rank reversals by 50% relative to ad-hoc aggregation; the HEAS champion wins all 32 held-out ecological scenarios - a null-safety result that would be uninterpretable under aggregation divergence. The contract additionally reduces coupling code by 97% (160 to 5 lines) relative to Mesa 3.3.1. Three case studies validate composability across ecological, enterprise, and mean-field ordinary differential equation dynamics.
comment: 12 pages, 1 figure. Python package: https://pypi.org/project/heas/ | Web playground: https://ryzhanghason.github.io/heas/
♻ ☆ GradPower: Powering Gradients for Faster Language Model Pre-Training
We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.
comment: 22 pages. A revised version is in preparation
♻ ☆ Do Phone-Use Agents Respect Your Privacy?
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
comment: work in progress
♻ ☆ NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.
comment: 13 pages, 8 figures, 7 tables
♻ ☆ Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
comment: Demo video url: https://jn00767.pages.surrey.ac.uk/catalogue-grounded-multimodal-attribution-for-museum-video/
♻ ☆ Coarsening Causal DAG Models
Directed acyclic graphical (DAG) models are a powerful tool for representing causal relationships among jointly distributed random variables, especially concerning data from across different experimental settings. However, it is not always practical or desirable to estimate a causal model at the granularity of given features in a particular dataset. There is a growing body of research on causal abstraction to address such problems. We contribute to this line of research by (i) providing novel graphical identifiability results for practically-relevant interventional settings, (ii) proposing an efficient, provably consistent algorithm for directly learning abstract causal graphs from interventional data with unknown intervention targets, and (iii) uncovering theoretical insights about the lattice structure of the underlying search space, with connections to the field of causal discovery more generally. As proof of concept, we apply our algorithm on synthetic and real datasets with known ground truths, including measurements from a controlled physical system with interacting light intensity and polarization.
comment: 27 pages, 5 figures; accepted to the 5th conference on Causal Learning and Reasoning (CLeaR)
♻ ☆ Resource-Efficient Variational Quantum Classifier
We introduce the unambiguous quantum classifier based on Hamming distance measurements combined with classical post-processing. The proposed approach improves classification performance through a more effective use of ansatz expressivity, while requiring significantly fewer circuit evaluations. Moreover, the method demonstrates enhanced robustness to noise, which is crucial for near-term quantum devices. We evaluate the proposed method on a breast cancer classification dataset. The unambiguous classifier achieves an average accuracy of 90%, corresponding to an improvement of 6.9 percentage points over the baseline, while requiring eight times fewer circuit executions per prediction. In the presence of noise, the improvement is reduced to approximately 3.1 percentage points, with the same reduction in execution cost. We substantiate our experimental results with theoretical evidence supporting the practical performance of the approach.
comment: 13 pages, 7 figures, 1 table; typos corrected, new references added, modification of model M3, new result box plots for all models, theoretical results adjusted, abstract and conclusion modified
♻ ☆ Adaptive Coverage Policies in Conformal Prediction
Traditional conformal prediction methods construct prediction sets such that the true label falls within the set with a user-specified coverage level. However, poorly chosen coverage levels can result in uninformative predictions, either producing overly conservative sets when the coverage level is too high, or empty sets when it is too low. Moreover, the fixed coverage level cannot adapt to the specific characteristics of each individual example, limiting the flexibility and efficiency of these methods. In this work, we leverage recent advances in e-values and post-hoc conformal inference, which allow the use of data-dependent coverage levels while maintaining valid statistical guarantees. We propose to optimize an adaptive coverage policy by training a neural network using a leave-one-out procedure on the calibration set, allowing the coverage level and the resulting prediction set size to vary with the difficulty of each individual example. We support our approach with theoretical coverage guarantees and demonstrate its practical benefits through a series of experiments.
comment: Code at: https://github.com/GauthierE/adaptive-coverage-policies
♻ ☆ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
comment: 10 pages, 6 figures
♻ ☆ Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system's context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.
♻ ☆ Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Graph neural networks (GNNs) have achieved strong performance across various real-world domains. Nevertheless, they suffer from oversquashing, where long-range information is distorted as it is compressed through limited message-passing pathways. This bottleneck limits their ability to capture essential global context and decreases their performance, particularly in dense and heterophilic regions of graphs. To address this issue, we propose a novel graph learning framework that enriches node embeddings via cross-attentive cohesive subgraph representations to mitigate the impact of excessive long-range dependencies. This framework enhances the node representation by emphasizing cohesive structure in long-range information but removing noisy or irrelevant connections. It preserves essential global context without overloading the narrow bottlenecked channels, which further mitigates oversquashing. Extensive experiments on multiple benchmark datasets demonstrate that our model achieves consistent improvements in classification accuracy over standard baseline methods.
♻ ☆ Fair Representation in Parliamentary Summaries: Measuring and Mitigating Inclusion Bias
The The use of Large language models (LLMs) to summarise parliamentary proceedings presents a promising means of increasing the accessibility of democratic participation. However, as these systems increasingly mediate access to political information -- filtering and framing content before it reaches users -- there are important fairness considerations to address. In this work, we evaluate 5 LLMs (both proprietary and open-weight) in the summarisation of plenary debates from the European Parliament to investigate the representational biases that emerge in this context. We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries. Across all models and experiments, we find that speakers are less accurately represented in the final summary on the basis of (i) their speaking-order (speeches in the middle of the debate were systematically excluded), (ii) language spoken (non-English speakers were less faithfully represented), and (iii) political affiliations (better outcomes for left-of-centre parties). We further show how biases in these contexts can be decomposed to distinguish inclusion bias (systematic omission) from hallucination bias (systematic misrepresentation), and explore the effect of different mitigation strategies. Prompting strategies do not affect these biases. Instead, we propose a hierarchical summarisation method that decomposes the task into simpler extraction and aggregation steps, which we show significantly improves the positional/speaking-order bias across all models. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.
comment: Extended journal version of "Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation" (arXiv:2507.14221), which appeared at the AIDEM Workshop, ECML-PKDD 2025. This version extends the original with cross-lingual bias analysis, a two-level hierarchical summarisation method, and human annotation validation of the evaluation framework
♻ ☆ Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework
In multi-source transfer learning, a key challenge lies in how to appropriately differentiate and utilize heterogeneous source tasks. However, existing multi-source methods typically focus on optimizing either the source weights or the amount of transferred samples, largely neglecting their joint consideration. In this work, we propose a theoretical framework, Unified Optimization of Weights and Quantities (UOWQ), that jointly determines the optimal source weights and transfer quantities for each source task. Specifically, the framework formulates multi-source transfer learning as a parameter estimation problem based on an asymptotic analysis of a Kullback--Leibler divergence--based generalization error measure, leading to two main theoretical findings: 1) using all available source samples is always optimal when the weights are properly adjusted; 2) the optimal source weights are characterized by a principled optimization problem whose structure explicitly incorporates the Fisher information, parameter discrepancy, parameter dimensionality, and transfer quantities. Building on the theoretical results, we further propose a practical algorithm for multi-source transfer learning, and extend it to multi-task learning settings where each task simultaneously serves as both a source and a target. Extensive experiments on real-world benchmarks, including DomainNet and Office-Home, demonstrate that UOWQ consistently outperforms strong baselines. The results validate both the theoretical predictions and the practical effectiveness of our framework.
♻ ☆ Where You Place the Norm Matters: From Prejudiced to Neutral Initializations
Normalization layers were introduced to stabilize and accelerate training, yet their influence is critical already at initialization, where they shape signal propagation and output statistics before parameters adapt to data. In practice, both which normalization to use and where to place it are often chosen heuristically, despite the fact that these decisions can qualitatively alter a model's behavior. We provide a theoretical characterization of how normalization choice and placement (Pre-Norm vs. Post-Norm) determine the distribution of class predictions at initialization, ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes. We show that these architectural decisions induce systematic shifts in the initial prediction regime, thereby modulating subsequent learning dynamics. By linking normalization design directly to prediction statistics at initialization, our results offer principled guidance for more controlled and interpretable network design, including clarifying how widely used choices such as BatchNorm vs. LayerNorm and Pre-Norm vs. Post-Norm shape behavior from the outset of training.
comment: Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300. Copyright 2026 by the author(s)
♻ ☆ Think Anywhere in Code Generation
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
♻ ☆ Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
comment: Accepted in 8th Annual Learning for Dynamics & Control Conference (L4DC)
♻ ☆ Virasoro Symmetry in Neural Network Field Theories
Neural Network Field Theories (NN-FTs) typically describe Generalized Free Fields that lack a local stress-energy tensor in two dimensions, obstructing the realization of Virasoro symmetry. We present the ``Log-Kernel'' (LK) architecture, which enforces local conformal symmetry via a specific rotation-invariant spectral prior $p(k) \propto |k|^{-2}$. We analytically derive the emergence of the Virasoro algebra from the statistics of the neural ensemble. We validate this construction through numerical simulation, computing the central charge $c_{exp} = 0.9958 \pm 0.0196$ (theoretical $c=1$) and confirming the scaling dimensions of vertex operators. Furthermore, we demonstrate that finite-width corrections generate interactions scaling as $1/N$. Finally, we extend the framework to include fermions and boundary conditions, realizing the super-Virasoro algebra. We verify the $\mathcal{N}=1$ super-Virasoro algebra by measuring the supercurrent correlator to $96\%$ accuracy. We further demonstrate conformal boundary conditions on the upper half-plane, achieving 99\% agreement for boundary fermion and boson propagators.
comment: 1+23 pages, 6 figures;
♻ ☆ Output Embedding Centering for Stable LLM Pretraining
Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs at the end of training is output logit divergence. The most widely used mitigation strategies, z-loss and logit soft-capping, merely address the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings' geometry and identify anisotropic embeddings as its source. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and demonstrate that it suppresses output logit divergence. OEC can be implemented in two different ways: as a deterministic operation called $μ$-centering, or a regularization method called $μ$-loss. Our experiments show that both variants outperform z-loss in terms of training stability, while being on par with logit soft-capping. This holds true both in the presence and the absence of weight tying. As a secondary result, we find that $μ$-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
comment: Additional experiments using logit soft-capping & weight tying
♻ ☆ Graph-Informed Adversarial Modeling: Infimal Subadditivity of Interpolative Divergences
We study adversarial learning when the target distribution factorizes according to a known Bayesian network. For interpolative divergences, including $(f,Γ)$-divergences, we prove a new infimal subadditivity principle showing that, under suitable conditions, a global variational discrepancy is controlled by an average of family-level discrepancies aligned with the graph. In an additive regime, the surrogate is exact. This closes a theoretical gap in the literature; existing subadditivity results justify graph-informed adversarial learning for classical discrepancies, but not for interpolative divergences, where the usual factorization argument breaks down. In turn, we provide a justification for replacing a standard, graph-agnostic GAN with a monolithic discriminator by a graph-informed GAN (GiGAN) with localized family-level discriminators, without requiring the optimizer itself to factorize according to the graph. We also obtain parallel results for integral probability metrics and proximal optimal transport divergences, identify natural discriminator classes for which the theory applies, and present experiments showing improved stability and structural recovery relative to graph-agnostic baselines.
comment: 34 pages, 9 figures
♻ ☆ Reinforcement Learning-based Task Offloading in the Internet of Wearable Things
Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem from the limited battery power and insufficient computation resources available on wearable devices. On the other hand, with the popularity of smart wearables, there is a consistent increase in the development of new computationally intensive and latency-critical applications. In such a context, task offloading allows wearables to leverage the resources available on nearby edge devices to enhance the overall user experience. This paper proposes a framework for Reinforcement Learning (RL)-based task offloading in the IoWT. We formulate the task offloading process considering the tradeoff between energy consumption and task accomplishment time. Moreover, we model the task offloading problem as a Markov Decision Process (MDP) and utilize the Q-learning technique to enable the wearable device to make optimal task offloading decisions without prior knowledge. We evaluate the performance of the proposed framework through extensive simulations for various applications and system configurations conducted in the ns-3 network simulator. We also show how varying the main system parameters of the Q-learning algorithm affects the overall performance in terms of average task accomplishment time, average energy consumption, and percentage of tasks offloaded.
comment: Withdrawn by the authors. A revised version is under preparation
♻ ☆ Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.
♻ ☆ LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.
♻ ☆ Computationally efficient Gauss-Newton reinforcement learning for model predictive control
Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but converge at most linearly, making them inefficient when each policy update requires solving an optimal control problem, as is the case with MPC. While MPC policies are typically low parameterized and thus amenable to second-order approaches, existing second-order methods demand second-order policy derivatives, which can be computationally intractable. This work introduces a Gauss-Newton approximation of the deterministic policy Hessian that eliminates the need for second-order policy derivatives, enabling superlinear convergence with minimal computational overhead. To further improve robustness, we propose a momentum-based Hessian averaging scheme for stable training under noisy estimates coupled with an adaptive trustregion. We demonstrate the effectiveness of the approach on a nonlinear continuously stirred tank reactor (CSTR), showing faster convergence and improved data efficiency over state-of-the-art firstorder methods and deep RL approaches.
comment: 17 pages, 9 figures, submitted to Elsevier in the special issue "Reinforcement Learning and Its Applications to Process Systems Engineering Problems" in the journal "Computers and Chemical Engineering"
♻ ☆ Efficient Reasoning with Balanced Thinking ICLR 2026
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
comment: Accepted by ICLR 2026
♻ ☆ Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invariant dynamics. Unlike classical global Lyapunov stability, absorbing invariance expands the class of stability-preserving architectures, enabling more flexible and expressive EBMs. We extend EBM theory to nonsmooth activations by establishing negative energy dissipation via Clarke derivatives and deriving new conditions for radial unboundedness, exposing a stability-expressivity tradeoff in standard EBMs. To overcome this, we introduce a hybrid architecture with a dynamical visible layer and static hidden layers, prove absorbing invariance under mild assumptions, and show that these guarantees extend to port-Hamiltonian EBMs. Experiments on metric-deformed multi-well and ring systems validate the approach, showcasing how our hybrid EBM architecture combines expressivity with sound and provable safety guarantees by design.
♻ ☆ Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection
Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at https://github.com/arasyi/hqcae-network-intrusion-detection.
comment: The authors have identified limitations in the experimental evaluation, which are insufficient to fully support the paper's conclusions. The manuscript is withdrawn pending additional experiments and analysis
♻ ☆ LEXam: Benchmarking Legal Reasoning on 340 Law Exams ICLR 2026
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 7,537 law exam questions in English and German. It includes both long-form, open-ended questions and multiple-choice questions with varying numbers of options. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/.
comment: Accepted to ICLR 2026
♻ ☆ Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA
Dynamic Algorithm Configuration (DAC) studies the efficient identification of control policies for parameterized optimization algorithms. Numerous studies leverage Reinforcement Learning (RL) to address DAC challenges; however, applying RL often requires extensive domain expertise. In this work, we conduct a comprehensive study of two deep-RL algorithms--Double Deep Q-Networks (DDQN) and Proximal Policy Optimization (PPO)--for controlling the population size of the $(1+(λ,λ))$-GA on OneMax instances. Although OneMax is structurally simple, learning effective control policies for the $(1+(λ,λ))$-GA induces a highly challenging DAC landscape, making it a controlled yet demanding benchmark. Our investigation reveals two fundamental challenges limiting DDQN and PPO: scalability degradation and learning instability, traced to under-exploration and planning horizon coverage. To address under-exploration, we introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN exploration. This eliminates instance-specific hyperparameter tuning and ensures consistent effectiveness across problem scales. To resolve planning horizon coverage, we demonstrate that undiscounted learning succeeds in DDQN, while PPO faces fundamental variance issues necessitating alternative designs. We further show that while hyperparameter optimization enhances PPO's stability, it consistently fails to identify effective policies. Finally, DDQN with adaptive reward shifting achieves performance comparable to theoretically derived policies with vastly improved sample efficiency, outperforming prior DAC approaches by orders of magnitude. Our findings provide insights into the fundamental obstacles faced by standard deep-RL approaches in this challenging DAC setting and highlight the key methodological ingredients required for effective learning.
comment: arXiv admin note: text overlap with arXiv:2502.20265
♻ ☆ StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold NeurIPS 2025
Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $U\!SV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter's input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.
comment: NeurIPS 2025 Spotlight
♻ ☆ Partial Feedback Online Learning
We study a new learning protocol, termed partial-feedback online learning, where each instance admits a set of acceptable labels, but the learner observes only one acceptable label per round. We highlight that, while classical version space is widely used for online learnability, it does not directly extend to this setting. We address this obstacle by introducing a collection version space, which maintains sets of hypotheses rather than individual hypotheses. Using this tool, we obtain a tight characterization of learnability in the set-realizable regime. In particular, we define the Partial-Feedback Littlestone dimension (PFLdim) and the Partial-Feedback Measure Shattering dimension (PMSdim), and show that they tightly characterize the minimax regret for deterministic and randomized learners, respectively. We further identify a nested inclusion condition under which deterministic and randomized learnability coincide, resolving an open question of Raman et al. (2024b). Finally, given a hypothesis space H, we show that beyond set realizability, the minimax regret can be linear even when |H|=2, highlighting a barrier beyond set realizability.
comment: 40 pages. Fixed some typos in the proof and improved readability
♻ ☆ A Polynomial-Time Algorithm for Variational Inequalities under the Minty Condition
Solving (Stampacchia) variational inequalities (SVIs) is a foundational problem at the heart of optimization. However, this expressivity comes at the cost of computational hardness. As a result, most research has focused on carving out specific subclasses that elude those intractability barriers. A classical property that goes back to the 1960s is the Minty condition, which postulates that the Minty VI (MVI) problem admits a solution. In this paper, we establish the first polynomial-time algorithm -- with complexity growing polynomially in the dimension $d$ and $\log(1/ε)$ -- for solving $ε$-SVIs for Lipschitz continuous mappings under the Minty condition. Prior approaches either incurred an exponentially worse dependence on $1/ε$ (and other natural parameters of the problem) or made more restrictive assumptions, such as monotonicity. To do so, we introduce a new variant of the ellipsoid algorithm whereby separating hyperplanes are obtained after taking a descent step from the center of the ellipsoid. It succeeds even though the set of SVIs can be nonconvex and not fully dimensional. Moreover, when our algorithm is applied to an instance with no MVI solution and fails to identify an SVI solution, it produces a succinct certificate of MVI infeasibility. We also show that deciding whether the Minty condition holds is $\mathsf{coNP}$-complete, thereby establishing that the disjunction of those two problems is polynomial-time solvable even though each problem is individually intractable. We provide several extensions and new applications of our main results. Most notably, we obtain the first polynomial-time algorithms for computing Nash equilibria in multi-player harmonic games. Finally, in two-player general-sum concave games, we give the first polynomial-time algorithm that outputs either a Nash equilibrium or a strict coarse correlated equilibrium.
comment: V3 polishes the writing and makes a correction to Theorem 5.2
♻ ☆ MemRerank: Preference Memory for Personalized Product Reranking
LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
comment: correct author name in metadata
♻ ☆ WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport
The Wasserstein-Fisher-Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce WFR Flow Matching (WFR-FM), a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth-death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time. The Python code is available at https://github.com/QiangweiPeng/WFR-FM.
Optimization and Control 54
☆ A weak transport approach to the Schrödinger-Bass bridge
We study the Schrödinger-Bass problem, a one-parameter family of semimartingale optimal transport problems indexed by $β>0$, whose limiting regimes interpolate between the classical Schrödinger bridge, the Brenier-Strassen problem, and, after rescaling, the martingale Benamou-Brenier (Bass) problem. Our first main result is a static formulation. For each $β>0$, we prove that the dynamic Schrödinger-Bass problem is equivalent to a static weak optimal transport (WOT) problem with explicit cost $C_{\mathrm{SB}}^β$. This yields primal and dual attainment, as well as a structural characterization of the optimal semimartingales, through the general WOT framework. The cost $C_{\mathrm{SB}}^β$ is constructed via an infimal convolution and deconvolution of the Schrödinger cost with the Wasserstein distance. In a broader setting, we show that such infimal convolutions preserve the WOT structure and inherit continuity, coercivity, and stability of both values and optimizers with respect to the marginals. Building on this formulation, we propose a Sinkhorn-type algorithm for numerical computation. We establish monotone improvement of the dual objective and, under suitable integrability assumptions on the marginals, convergence of the iteration to the unique optimizer. We then study the asymptotic regimes $β\uparrow\infty$ and $β\downarrow0$. We prove that the costs $C_{\mathrm{SB}}^β$ converge pointwise to the Schrödinger cost and, after natural rescaling, to the Brenier-Strassen and Bass costs. The associated values and optimal solutions are shown to converge to those of the corresponding limiting problems.
☆ Flexibility allocation in random bipartite matching markets: exact matching rates and dominance regimes
This paper studies how a fixed flexibility budget should be allocated across the two sides of a balanced bipartite matching market. We model compatibilities via a sparse bipartite stochastic block model in which flexible agents are more likely to connect with agents on the opposite side, and derive an exact variational formula for the asymptotic matching rate under any flexibility allocation. The derivation extends the local weak convergence framework of [BLS11] from single-type to multi-type unimodular Galton-Watson trees, reducing the matching rate to an explicit low-dimensional optimization problem. Using this formula, we analytically investigate when the one-sided allocation, which concentrates all flexibility on one side, dominates the two-sided allocation and vice versa, sharpening and extending the comparisons of [FMZ26] which relied on approximate algorithmic bounds rather than an exact characterization of the matching rate.
☆ AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.
☆ Analytic Optimal Control for a Class of Driftless x-Flat Systems
This paper studies optimal trajectory-tracking for driftless, x-flat nonlinear systems with three states and two inputs. The tracking problem is formulated in Bolza form with a quadratic cost of the tracking error and its derivative. Applying Pontryagin's maximum principle yields a mixed regular-singular optimal control problem. By exploiting geometric properties and a specific relation between the weighting matrices, a closed-form expression for the costate and an explicit feedback law for both inputs is derived. Thereby, the numerical solution of a two-point boundary-value problem is avoided. The singular input leads to a bang-singular-bang optimal control structure, while on the singular arc, the tracking error dynamics reduces to a linear dynamics of order two. The approach is illustrated for the kinematic model of a steerable axle, demonstrating accurate trajectory-tracking.
☆ Incorporating circular economy policies into product supply chains using bilevel optimization -- A case study on coffee packaging
Transitioning to a Circular Economy requires policies to drive sustainable practices. This study proposes a bilevel optimization framework to evaluate the combined use of carbon taxes and subsidies in promoting circular supply chains under varying budget levels. A case study of the coffee packaging supply chain with an Extended Producer Responsibility scenario is used to demonstrate this approach. The framework captures the hierarchical interaction between a regional government (upper level), which aims to minimize environmental impacts, and coffee companies (lower level), which seek to minimize costs. Two bilevel optimization problems are formulated based on two environmental objectives: (1) minimization of greenhouse gas (GHG) emissions, and (2) maximization of circularity. The model integrates mixed-integer linear programming (MILP) with life cycle assessment (LCA), techno-economic assessment (TEA) and circularity assessment. Results demonstrate that subsidies effectively drive supply chain shifts toward low-emission and high-circularity configurations, while carbon taxes alone have a more limited impact. Sensitivity analyses highlight the influence of key parameters, such as glass washing distance and loss rates, on policy effectiveness. Overall, the study provides a bilevel optimization framework with quantitative insights to support policy design for sustainable circular supply chains.
☆ On the Convexity of the Solution Set of Linear Complementarity Problem over Tensor Spaces
This paper investigates the convexity of the solution set of the linear complementarity problems over tensor spaces (TLCPs). We introduce the notion of a $T$-column sufficient tensor and study its properties and relationships with several structured tensors. An equivalent condition for the convexity of the solution set of the $\mathrm{TLCP}$ is established. In addition, sufficient conditions for uniqueness and for feasibility implying solvability are derived.
comment: 17 pages, comments are welcome
☆ Sensitivity analysis for stopping criteria with application to organ transplantations
We consider a stopping problem and its application to the decision-making process regarding the optimal timing of organ transplantation for individual patients. At each decision period, the patient state is inspected and a decision is made whether to transplant. If the organ is transplanted, the process terminates; otherwise, the process continues until a transplant happens or the patient dies. Under suitable conditions, we show that there exists a control limit optimal policy. We propose a smoothed perturbation analysis (SPA) estimator for the gradient of the total expected discounted reward with respect to the control limit. Moreover, we show that the SPA estimator is asymptotically unbiased.
☆ Random-Subspace Sequential Quadratic Programming for Constrained Zeroth-Order Optimization
We study nonlinear constrained optimization problems in which only function evaluations of the objective and constraints are available. Existing zeroth-order methods rely on noisy gradient and Jacobian surrogates in high dimensions, making it difficult to simultaneously achieve computational efficiency and accurate constraint satisfaction. We propose a zeroth-order random-subspace sequential quadratic programming method (ZO-RS-SQP) that combines two-point directional estimation with low-dimensional SQP updates. At each iteration, the method samples a random low-dimensional subspace, estimates the projected objective gradient and constraint Jacobians using two-point evaluations, and solves a reduced quadratic program to compute the step. As a result, the per-iteration evaluation cost scales with the subspace dimension rather than the ambient dimension, while retaining the structured linearized-constraint treatment of SQP. We also consider an Armijo line-search variant that improves robustness in practice. Under standard smoothness and regularity assumptions, we establish convergence to first-order KKT points with high probability. Numerical experiments illustrate the effectiveness of the proposed approach on nonlinear constrained problems.
☆ Explicit Distributed MPC: Reducing Computation and Communication Load by Exploiting Facet Properties
Classical Distributed Model Predictive Control (DiMPC) requires multiple iterations to achieve convergence, leading to high computational and communication burdens. This work focuses on the improvement of an iteration-free distributed MPC methodology that minimizes computational effort and communication load. The aforementioned methodology leverages multiparametric programming to compute explicit control laws offline for each subsystem, enabling real-time control without iterative data exchanges between subsystems. Extending our previous work on iteration-free DiMPC, here we introduce a FAcet-based Critical region Exploration Technique for iteration-free DiMPC (FACET-DiMPC) that further reduces computational complexity by leveraging facet properties to do targeted critical region exploration. Simulation results demonstrate that the developed method achieves comparable control performance to centralized methods, while significantly reducing communication overhead and computation time. In particular, the proposed methodology offers substantial efficiency gains in terms of the average computation time reduction of 98% compared to classic iterative DiMPC methods and 42% compared to iteration-free DiMPC methods, making it well-suited for real-time control applications with tight latency and computation constraints.
☆ Data-Driven Tube-Based Zonotopic Predictive Control With Nonconvex Layered Terminal Sets
This paper presents a data-driven tube-based zonotopic predictive control (DTZPC) framework with nonconvex layered terminal sets. Existing DTZPC schemes with closed-loop guarantees typically rely on a single ellipsoidal terminal set, which can be conservative and thereby limit feasibility. We propose a layered terminal-set design that decouples stability certification, feasibility enlargement, and motion-region screening into three components with distinct roles. First, an offline-designed feedback gain together with a contractive constrained zonotope provides a terminal ingredient for stability certification, while avoiding probabilistic feedback synthesis in high-dimensional DTZPC. Second, we derive a data-driven characterization of the inverse admissible closed-loop model set, avoiding the conservatism of interval-matrix relaxation and inversion. Combined with exact set multiplication, this yields inner and outer approximations of the maximal robust positively invariant (MRPI) set under fixed closed-loop dynamics. The inner approximation serves as a nonconvex terminal set to enlarge feasibility, whereas the outer approximation provides certified motion-region descriptions for fast screening and monitoring. Numerical examples demonstrate tighter inverse-set enclosures and improved feasibility over existing convex-terminal DTZPC schemes.
☆ Safe Control of Feedback-Interconnected Systems via Singular Perturbations
Control Barrier Functions (CBFs) have emerged as a powerful tool in the design of safety-critical controllers for nonlinear systems. In modern applications, complex systems often involve the feedback interconnection of subsystems evolving at different timescales, e.g., two parts from different physical domains (e.g., the electrical and mechanical parts of robotic systems) or a physical plant and an (optimization or control) algorithm. In these scenarios, safety constraints often involve only a portion of the overall system. Inspired by singular perturbations for stability analysis, we develop a formal procedure to lift a safety certificate designed on a reduced-order model to the overall feedback-interconnected system. Specifically, we show that under a sufficient timescale separation between slow and fast dynamics, a composite CBF can be designed to certify the forward invariance of the safe set for the interconnected system. As a result, the online safety filter only needs to be solved for the lower-dimensional, reduced-order model. We numerically test the proposed approach on: (i) a robotic arm with joint motor dynamics, and (ii) a physical plant driven by an optimization algorithm.
☆ Fixed-time-stable ODE Representation of Lasso
Lasso problems arise in many areas, including signal processing, machine learning, and control, and are closely connected to sparse coding mechanisms observed in neuroscience. A continuous-time ordinary differential equation (ODE) representation of the Lasso problem not only enables its solution on analog computers but also provides a framework for interpreting neurophysiological phenomena. This article proposes a fixed-time-stable ODE representation of the Lasso problem by first transforming it into a smooth nonnegative quadratic program (QP) and then designing a projection-free Newton-based ODE representation of the Lasso problem by first transforming it into a smooth nonnegative quadratic program (QP) and then designing a projection-free Newton-based fixed-time-stable ODE system for solving the corresponding Karush-Kuhn-Tucker (KKT) conditions. Moreover, the settling time of the ODE is independent of the problem data and can be arbitrarily prescribed. Numerical experiments verify that the trajectory reaches the optimal solution within the prescribed time.
comment: 6 pages
☆ Faster Symmetric Rendezvous on Four or More Locations
In the symmetric rendezvous problem two players follow the same (randomized) strategy to visit one of $n$ locations in each time step $t=0,1,2,\dots$. Their goal is to minimize the expected time until they visit the same location and thus meet. Anderson and Weber [J. Appl. Prob., 1990] proposed a strategy that operates in rounds of $n-1$ steps: a player either remains in one location for $n-1$ steps or visits the other $n-1$ locations in random order; the choice between these two options is made with a probability that depends only on $n$. The strategy is known to be optimal for $n=2$ and $n=3$, and there is convincing evidence that it is not optimal for $n=4$. We show that it is not optimal for any $n\geq 4$, by constructing a strategy with a smaller expected meeting time.
☆ Reinforcement Learning for Speculative Trading under Exploratory Framework
We study a speculative trading problem within the exploratory reinforcement learning (RL) framework of Wang et al. [2020]. The problem is formulated as a sequential optimal stopping problem over entry and exit times under general utility function and price process. We first consider a relaxed version of the problem in which the stopping times are modeled by the jump times of Cox processes driven by bounded, non-randomized intensity controls. Under the exploratory formulation, the agent's randomized control is characterized via the probability measure over the jump intensities, and their objective function is regularized by Shannon's differential entropy. This yields a system of the exploratory HJB equations and Gibbs distributions in closed-form as the optimal policy. Error estimates and convergence of the RL objective to the value function of the original problem are established. Finally, an RL algorithm is designed, and its implementation is showcased in a pairs-trading application.
comment: 37 pages, 14 figures
☆ Balancing Morality and Economics: Population Games with Herding and Inertia
The adoption of clean technologies (CTs) plays an important role in reducing carbon dioxide (CO$_2$) emissions. We study CT adoption in a large population of consumers with heterogeneous behavioral tendencies. We model the interaction among the agents as a multi-type mean-field game in which the agents choose between clean and polluting technology based products and may either behave as rationals (trading off price and moral incentives), herding agents (just follow the majority), or lethargic agents exhibiting inertia toward adopting the new technologies. We characterize equilibrium CT adoption levels using the recently introduced notion of $\boldsymbolα$-Rational Nash Equilibrium ($\boldsymbolα$-RNE) and its multi-type extension. We then identify a stable subset using the limits of a stochastic turn-by-turn behavioral dynamics. Our results highlight the role of population composition in determining CT adoption. In particular, widespread adoption requires either a sufficiently small price disadvantage for CTs or the presence of a sufficiently large herding population that can be influenced through social awareness programs. Surprisingly, we could prove that environmental damages do not provide sufficient incentives to increase CT adoption.
☆ Output Corridor Impulsive Control of First-order Continuous System with Non-local Attractivity Analysis
This paper addresses the design of an impulsive controller for a continuous scalar time-invariant linear plant that constitutes the simplest conceivable model of chemical kinetics. The model is ubiquitous in process control as well as pharmacometrics and readily generalizes to systems of Wiener structure. Given the impulsive nature of the feedback, the control problem formulation is particularly suited to discrete dosing applications in engineering and medicine, where both doses and inter-dose intervals are manipulated. Since the feedback controller acts at discrete time instants and employs both amplitude and frequency modulation, whereas the plant is continuous, the closed-loop system exhibits hybrid dynamics featuring complex nonlinear phenomena. The problem of confining the plant output to a predefined corridor of values is considered. The method at the heart of the proposed approach is to design a stable periodic solution, called a 1-cycle, whose one-dimensional orbit coincides with the predefined corridor. Conditions ensuring local and global attractivity of the 1-cycle are established. As a numerical illustration of the proposed approach, the problem of intravenous paracetamol dosing is considered.
☆ A Weak Notion of Symmetry for Dynamical Systems
Many nonlinear dynamical systems exhibit symmetry, affording substantial benefits for control design, observer architecture, and data-driven control. While the classical notion of group invariance enables a cascade decomposition of the system into highly structured subsystems, it demands very rigid structure in the original system. Conversely, much more general notions (e.g., partial symmetry) have been shown to be sufficient for obtaining less-structured decompositions. In this work, we propose a middle ground termed "weak invariance", studying diffeomorphisms (resp., vector fields) that are group invariant up to a diffeomorphism of (resp., vector field on) the symmetry group. Remarkably, we prove that weak invariance implies that this diffeomorphism of (resp., vector field on) the symmetry group must be an automorphism (resp., group linear). Additionally, we demonstrate that a vector field is weakly invariant if and only if its flow is weakly invariant, where the associated group linear vector field generates the associated automorphisms. Finally, we show that weakly invariant systems admit a cascade decomposition in which the dynamics are group affine along the orbits. Weak invariance thus generalizes both classical invariance and the important class of group affine dynamical systems on Lie groups, laying a foundation for new methods of symmetry-informed control and observer design.
comment: 6 pages, 0 figures
☆ Global Geometry of Orthogonal Foliations in the Control Allocation of Signed-Quadratic Systems
This work formalizes the differential topology of redundancy resolution for systems governed by signed-quadratic actuation maps. By analyzing the minimally redundant case, the global topology of the continuous fiber bundle defining the nonlinear actuation null-space is established. The distribution orthogonal to these fibers is proven to be globally integrable and governed by an exact logarithmic potential field. This field foliates the actuator space, inducing a structural stratification of all orthants into transverse layers whose combinatorial sizes follow a strictly binomial progression. Within these layers, adjacent orthants are continuously connected via lower-dimensional strata termed reciprocal hinges, while the layers themselves are separated by boundary hyperplanes, or portals, that act as global sections of the fibers. This partition formally distinguishes extremal and transitional layers, which exhibit fundamentally distinct fiber topologies and foliation properties. Through this geometric framework, classical pseudo-linear static allocation strategies are shown to inevitably intersect singular boundary hyperplanes, triggering infinite-derivative kinetic singularities and fragmenting the task space into an exponential number of singularity-separated sectors. In contrast, allocators derived from the orthogonal manifolds yield continuously differentiable global sections with only a linear number of sectors for transversal layers, or can even form a single global diffeomorphism to the task space in the case of the two extremal layers, thus completely avoiding geometric rank-loss and boundary-crossing singularities. These theoretical results directly apply to the control allocation of propeller-driven architectures, including multirotor UAVs, marine, and underwater vehicles.
comment: Multimedia material attached
☆ When does learning pay off? A study on DRL-based dynamic algorithm configuration for carbon-aware scheduling
Deep reinforcement learning (DRL) has recently emerged as a promising tool for Dynamic Algorithm Configuration (DAC), enabling evolutionary algorithms to adapt their parameters online rather than relying on static tuned configurations. While DRL can learn effective control policies, training is computationally expensive. This cost may be justified if learned policies generalize, allowing the training effort to transfer across instance types and problem scales. Yet, for real-world optimization problems, it remains unclear whether this promise holds in practice and under which conditions the investment in learning pays off. In this work, we investigate this question in the context of the carbon-aware permutation flow-shop scheduling problem. We develop a DRL-based DAC framework and train it exclusively on small, simple instances. We then deploy the learned policy on both similar and more complex unseen instances and compare its performance against a static tuned baseline, which provides a fair point of comparison. Our findings show that the proposed method provides a strong dynamic algorithm control policy that can be effectively transferred to different unseen problem instances. Notably, on simple and cheap to compute instances, similar to those observed during training and tuning, DRL performs comparably with the statically tuned baseline. However, as instance characteristics diverge and computational complexities increase, the DRL-learned policy continuously outperforms static tuning. These results confirm that DRL can acquire robust and generalizable control policies which are effective beyond the training instance distributions. This ability to generalize across instance types makes the initial computational investment worthwhile, particularly in settings where static tuning struggles to adapt to changing problem scenarios.
☆ Scaled Relative Graphs and Dynamic Integral Quadratic Constraints: Connections and Computations for Nonlinear Systems
Scaled relative graphs (SRGs) enable graphical analysis and design of nonlinear systems. In this paper, we present a systematic approach for computing both soft and hard SRGs of nonlinear systems using dynamic integral quadratic constraints (IQCs). These constraints are exploited via application of the S-procedure to compute tractable SRG overbounds. In particular, we show that the multipliers associated with the IQCs define regions in the complex plane. Soft SRG computations are formulated through frequency-domain conditions, while hard SRGs are obtained via hard factorizations of multipliers and linear matrix inequalities. The overbounds are used to derive an SRG-based feedback stability result for Lur'e-type systems, providing a new graphical interpretation of classical IQC stability results with dynamic multipliers.
comment: 6 pages, 1 figure
☆ Bilevel Programming Approach for Image Restoration Problems with Automatically Hyperparameter Selection
In optimization-based image restoration models, the correct selection of hyperparameters is crucial for achieving superior performance. However, current research typically involves manual tuning of these hyperparameters, which is highly time-consuming and often lacks accuracy. In this paper, we concentrate on the automated selection of hyperparameters in the context of image restoration and present a bilevel programming approach that can simultaneously select the optimal hyperparameters and achieve high-quality restoration results. For implementation, we reformulate the bilevel programming problem that incorporates an inequality constraint related to the difference-of-convex functions. Following this, we address a sequence of nonsmooth convex programming problems by employing a feasibility penalty function along with a proximal point term. In this context, the nonsmooth convex programming problem uses the solution of the lower-level problem, which is derived through the alternating direction method of multipliers. Theoretically, we prove that the sequence generated by the algorithm converges to a Karush-Kuhn-Tucker stationary point of the inequality-constrained equivalent bilevel programming problem. We conduct a series of tests on both simulations and real images, which demonstrate that the proposed algorithm achieve superior restoration quality while requiring less computing time compared to other hyperparameter selection methods.
☆ Hausdorff compactness and regularity for classes of open sets under geometric constraints
This article introduces innovative classes of open sets in \(\mathbb{R}^{N}\), where \(N=2, 3\), characterized by a geometric property associated with the inward normal. The focus lies on proving compactness results for the Hausdorff topology within these classes. Furthermore, the paper establishes the equivalence of convergences, encompassing Hausdorff, compact, and characteristic functions, for select classes. We also investigate the regularity of the thickness function associated with these domains and analyze how the regularity of the fixed convex set \(C\) influences the boundary regularity of the admissible shapes.
☆ Stability of the Timoshenko Beam Equation with One Weakly Degenerate Local Kelvin-Voigt Damping
We consider the Timoshenko beam equation with locally distributed Kelvin-Voigt damping, which affects either the shear stress or the bending moment. The damping coefficient exhibits a singularity, causing its derivative to be discontinuous. By using the frequency domain method and multiplier technique, we prove that the associated semigroup is polynomial stability. Specifically, regardless of whether the local Kelvin-Voigt damping acts on the shear stress or the bending moment, the system decays polynomially with rate $t^{-\frac{1}{2}}$.
☆ Improving Operational Feasibility in Large-Scale Power System Planning
Large-scale power system planning mostly uses linearized, active power only approximations of the power flow equations, ignores many operational constraints, and tests the operational feasibility of the resulting systems only under strongly simplifying assumptions. We propose an approach to obtain solutions to large instances of the alternating current capacity expansion problem via redispatch and reinforcement of an initial solution. The problem formulation considers simultaneous expansion of generators, reactive compensation devices, storage systems, and transmission. Furthermore, it includes operational constraints via startup procedures and capability curves of power sources and simplified stability limits via constraints on voltage angle differences and voltage magnitudes. To obtain initial solutions, we test several established and partly modified power flow approximations and integrate them into an approach for iterative transmission expansion planning, thereby obtaining convex formulations. We demonstrate the approach on large problem instances covering the islands of Great Britain and Ireland at the transmission level, for which we extend the open data source to model reactive power. We find that including transmission losses to determine the initial solution is most decisive, as the amount of redispatch and reinforcements necessary to obtain an alternating current feasible solution is reduced, whereas incorporating reactive power constraints did not lead to further improvements. Our approach ensures an alternating current feasible system under weak assumptions, thus guaranteeing steady-state voltage stability and allowing subsequent dynamic grid simulations, which is instrumental for planning stable future inverter-dominated power systems.
☆ State-constrained optimal control of the continuity equation and infinite-dimensional viscosity solutions
We study a finite horizon optimal control problem for the continuity equation under a weighted integral state constraint on the mass outside a fixed set. The model is cast in a Hilbert framework for densities. On a suitable invariant compact subset, we prove that the value function is Lipschitz continuous and satisfies, by dynamic programming, the associated infinite dimensional constrained Hamilton Jacobi Bellman equation in viscosity sense (subsolution in the interior, supersolution up to the boundary). We finally prove a comparison principle and uniqueness in the Lipschitz class.
comment: 27 pages, 0 figures
☆ Day-Ahead Offering for Virtual Power Plants: A Stochastic Linear Programming Reformulation and Projected Subgradient Method
Virtual power plants (VPPs) are an emerging paradigm that aggregates distributed energy resources (DERs) for coordinated participation in power systems, including bidding as a single dispatchable entity in the wholesale market. In this paper, we address a critical operational challenge for VPPs: the day-ahead offering problem under highly intermittent and uncertain DER outputs and market prices. The day-ahead offering problem determines the price-quantity pairs submitted by VPPs while balancing profit opportunities against operational uncertainties. First, we formulate the problem as a scenario-based two-stage stochastic adaptive robust optimization problem, where the uncertainty of the locational marginal prices follows a Markov process and DER uncertainty is characterized by static uncertainty sets. Then, motivated by the outer approximation principle of the column-and-constraint generation (CC&G) algorithm, we propose a novel inner approximation-based projected subgradient method. By exploiting the problem structure, we propose two novel approaches to improve computational tractability. First, we show that under mild modeling assumptions, the robust second-stage problem can be equivalently reformulated as a linear program (LP) with a nested resource allocation structure that is amenable to an efficient greedy algorithm. Furthermore, motivated by the computational efficiency of solving the reformulated primal second-stage problem and the isotonic structure of the first-stage feasible region, we propose an efficient projected subgradient algorithm to solve the overall stochastic LP problem. Extensive computational experiments using real-world data demonstrate that the overall projected subgradient descent method achieves about two orders of magnitude speedup over CC&G while maintaining solution quality.
comment: 30 pages, 8 figures. Submitted for publication
☆ Strong Convergence of FISTA for Affinely Constrained Convex Quadratic Minimization
In October 2025, research by Boţ, Fadili, and Nguyen, and by Jang and Ryu, led to the seminal result that Beck and Teboulle's FISTA converges weakly to a minimizer of the sum of two convex functions resolving a long-standing open problem. The first strong convergence result was obtained in November 2025 by Moursi, Naguib, Pavlovic, and Vavasis for affinely constrained convex minimization provided certain closedness conditions hold. In this paper, we prove strong convergence in the affine-quadratic case without any closedness assumption. Specializing this to the unconstrained case, we obtain the strong convergence of Nesterov's accelerated gradient method when applied to a convex quadratic objective function.
☆ Random Coordinate Descent on the Wasserstein Space of Probability Measures
Optimization over the space of probability measures endowed with the Wasserstein-2 geometry is central to modern machine learning and mean-field modeling. However, traditional methods relying on full Wasserstein gradients often suffer from high computational overhead in high-dimensional or ill-conditioned settings. We propose a randomized coordinate descent framework specifically designed for the Wasserstein manifold, introducing both Random Wasserstein Coordinate Descent (RWCD) and Random Wasserstein Coordinate Proximal{-Gradient} (RWCP) for composite objectives. By exploiting coordinate-wise structures, our methods adapt to anisotropic objective landscapes where full-gradient approaches typically struggle. We provide a rigorous convergence analysis across various landscape geometries, establishing guarantees under non-convex, Polyak-Łojasiewicz, and geodesically convex conditions. Our theoretical results mirror the classic convergence properties found in Euclidean space, revealing a compelling symmetry between coordinate descent on vectors and on probability measures. The developed techniques are inherently adaptive to the Wasserstein geometry and offer a robust analytical template that can be extended to other optimization solvers within the space of measures. Numerical experiments on ill-conditioned energies demonstrate that our framework offers significant speedups over conventional full-gradient methods.
☆ Scalable Ground-State Certification of Quantum Spin Systems via Structured Noncommutative Polynomial Optimization
A fundamental challenge in quantum physics is determining the ground-state properties of many-body systems. Whereas standard approaches, such as variational calculations, consist of writing down a wave function ansatz and minimizing over the possible states expressible by this ansatz, one can alternatively formulate the problem as a noncommutative polynomial optimization problem. This optimization problem can then be addressed using a hierarchy of semidefinite programming relaxations. In contrast to variational calculations, the semidefinite program can provide lower bounds for ground state energies and upper and lower bounds on observable expectation values. However, this approach typically suffers from severe scalability issues, limiting its applicability to small-to-medium-scale systems. In this article, we demonstrate that leveraging the inherent structures of the system can significantly mitigate these scalability challenges and thus allows us to compute meaningful bounds for quantum spin systems on up to $16\times16$ square lattices.
comment: 37 pages, 9 figures
☆ A Determinantal Approach to a Sharp $\ell^1-\ell^\infty-\ell^2$ Norm Inequality
We give a short linear--algebraic proof of the inequality \[ \|x\|_1\,\|x\|_\infty \le \frac{1+\sqrt{p}}{2}\,\|x\|_2^2, \] valid for every \(x\in\mathbb{R}^p\). This inequality relates three fundamental norms on finite-dimensional spaces and has applications in optimization and numerical analysis. Our proof exploits the determinantal structure of a parametrized family of quadratic forms, and we show the constant $(1+\sqrt{p})/2$ is optimal.
☆ On Integral Linear Constraints on Convex Cones
In this paper, we consider integral linear constraints and the dissipation inequality with linear supply rates for certain sets of trajectories confined pointwise in time to a convex cone which belongs to a finite-dimensional normed vector space. Such constraints are then shown to be satisfied if and only if a bounded linear functional exists which satisfies a conic inequality. This is analogous to the typical situation in which a quadratic supply rate over the entire space is related to a linear matrix inequality. A connection is subsequently drawn precisely to linear-quadratic control: by proper choice of cone, the main results can be applied to produce a known L1-gain analogue to the bounded real lemma in positive systems theory, as well as a non-strict version of the Kalman-Yakubovich-Popov Lemma in linear-quadratic control.
♻ ☆ Network and Risk Analysis of Surety Bonds
Surety bonds are financial agreements between a contractor (principal) and obligee (project owner) to complete a project. However, most large-scale projects involve multiple contractors, creating a network and introducing the possibility of incomplete obligations to propagate and result in project failures. Typical models for risk assessment assume independent failure probabilities within each contractor. However, we take a network approach, modeling the contractor network as a directed graph where nodes represent contractors and project owners and edges represent contractual obligations with associated financial records. To understand risk propagation throughout the contractor network, we extend the celebrated Friedkin-Johnsen model and introduce a stochastic process to simulate principal failures across the network. From a theoretical perspective, we show that under natural monotonicity conditions on the contractor network, incorporating network effects leads to increases in the average risk for the surety organization. We further use data from a partnering insurance company to validate our findings, estimating an approximately 2% higher exposure when accounting for network effects.
comment: 50 pages, 12 figures
♻ ☆ Optimizing a Worldwide-Scale Shipper Transportation Planning in a Carmaker Inbound Supply Chain
We study the shipper-side design of large-scale inbound transportation networks, motivated by the global supply chain of the carmaker Renault. We formalize the Shipper Transportation Planning Problem (STPP), which integrates discrete flow consolidation via explicit bin-packing, time-expanded routing, and operational regularity constraints. To solve this high-complexity combinatorial problem at an industrial scale, we propose a tailored Iterated Local Search (ILS) metaheuristic. The algorithm combines large-neighborhood search with MILP-based perturbations and leverages bundle-specific decompositions to obtain scalable lower bounds and effective search guidance. Computational experiments on real industrial data involving more than 700,000 commodities and 1.2 million arcs demonstrate that the ILS achieves an average gap of 7.9% to the best available lower bound. The results reveal a 23.2% cost-reduction potential compared to legacy planning benchmarks. Most significantly, the proposed framework is currently deployed in production at Renault, where it supports weekly strategic decisions and generates realized cost savings estimated at approximately 20 million euros per year. Our analysis yields key managerial insights: we demonstrate that explicit 1D bin-packing is a critical step forward for realistic consolidation modeling, that transport regularity offers a robust balance between cost and stability, and that high-volume global networks benefit significantly from in-house strategic planning over third-party outsourcing. To the best of our knowledge, this is the first work to successfully solve a shipper-side transportation design problem at this magnitude.
comment: 40 pages, 27 figures, 9 tables
♻ ☆ Decision-Focused Surrogate Modeling for Mixed-Integer Linear Optimization
Mixed-integer optimization is at the core of many online decision-making systems that demand frequent updates of decisions in real time. However, due to their combinatorial nature, mixed-integer linear programs (MILPs) can be difficult to solve, rendering them often unsuitable for time-critical online applications. To address this challenge, we develop a data-driven approach for constructing surrogate optimization models in the form of linear programs (LPs) that can be solved much more efficiently than the corresponding MILPs. We train these surrogate LPs in a decision-focused manner such that for different model inputs, they achieve the same or close to the same optimal solutions as the original MILPs. One key advantage of the proposed method is that it allows the incorporation of all the original MILP's linear constraints, which significantly increases the likelihood of obtaining feasible predicted solutions. Results from two computational case studies indicate that this decision-focused surrogate modeling approach is highly data-efficient and provides very accurate predictions of the optimal solutions. In these examples, it outperforms more commonly used neural-network-based optimization proxies.
comment: Published in Transactions on Machine Learning Research (2025). OpenReview: https://openreview.net/forum?id=A6tOXkkE4Z
♻ ☆ Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an $\varepsilon$-stationary point at a rate of $O(\varepsilon^{-4})$, matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.
♻ ☆ Linearizations and optimization problems in diffeological spaces
By generalizing the notion of linearization, a concept originally arising from microlocal analysis and symbolic calculus, to diffeological spaces, we make a first proposal setting for optimization problems in this category. We show how linearizations allow the construction of smooth paths and variational flows without requiring canonical charts or gradients. With these constructions, we introduce a general optimization algorithm adapted to diffeological spaces under weakened assumptions. The method applies to spaces of mappings with low regularity. Our results show that weak convergence toward minima or critical values can still be achieved under diffeological conditions. The approach extends classical variational methods into a flexible, non-linear infinite-dimensional framework. Preliminary steps to the search for fixed points of diffeological mappings are discussed.
♻ ☆ Multi-Timescale Primal Dual Hybrid Gradient with Application to Distributed Optimization
We propose two variants of the Primal Dual Hybrid Gradient (PDHG) algorithm for saddle point problems with block decomposable duals, hereafter called Multi-Timescale PDHG (MT-PDHG) and its accelerated variant (AMT-PDHG). Through novel mixtures of Bregman divergence and multi-timescale extrapolations, our MT-PDHG and AMT-PDHG converge under arbitrary updating rates for different dual blocks while remaining fully deterministic and robust to extreme delays in dual updates. We further apply our (A)MT-PDHG, augmented with the gradient sliding techniques introduced in Lan et al. (2020), Lan (2016), to distributed optimization. The flexibility in choosing different updating rates for different blocks allows a more refined control over the communication rounds between different pairs of agents, thereby improving the efficiencies in settings with heterogeneity in local objectives and communication costs. Moreover, with careful choices of penalty levels, our algorithms show linear and thus optimal dependency on function similarities, a measure of how similar the gradients of local objectives are. This provides a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).
♻ ☆ Branch-and-Bound Algorithms as Polynomial-time Approximation Schemes
Branch-and-bound algorithms (B&B) and polynomial-time approximation schemes (PTAS) are two seemingly distant areas of combinatorial optimization. We intend to (partially) bridge the gap between them while expanding the boundary of theoretical knowledge on the B\&B framework. Branch-and-bound algorithms typically guarantee that an optimal solution is eventually found. However, we show that the standard implementation of branch-and-bound for certain knapsack and scheduling problems also exhibits PTAS-like behavior, yielding increasingly better solutions within polynomial time. Our findings are supported by computational experiments and comparisons with benchmark methods. This paper is an extended version of a paper accepted at ICALP 2025
♻ ☆ Fragility-aware Classification for Understanding Risk and Improving Generalization
Classification models play a central role in data-driven decision-making applications such as medical diagnosis, recommendation systems, and risk assessment. Traditional performance metrics, such as accuracy and AUC, focus on overall error rates but fail to account for the confidence of incorrect predictions, i.e., the risk of confident misjudgments. This limitation is particularly consequential in safety-critical and cost-sensitive settings, where overconfident errors can lead to severe outcomes. To address this issue, we propose the Fragility Index (FI), a novel performance metric that evaluates classifiers from a risk-averse perspective by capturing the tail risk of confident misjudgments. We formulate FI within a robust satisficing (RS) framework to ensure robustness under distributional uncertainty. Building on this, we develop a tractable training framework that directly targets FI via a surrogate loss, and show that models trained under this framework admit provable bounds on FI. We further derive exact reformulations for a broad class of loss functions, including cross-entropy, hinge-type, and Lipschitz losses, and extend the approach to deep neural networks. Empirical results on real-world medical diagnosis tasks demonstrate that FI complements existing metrics by revealing error tail risk and improving decision quality. FI-based models achieve competitive accuracy and AUC while consistently reducing confident misjudgments and associated operational costs, offering a practical tool for improving robustness and reliability in risk-critical applications.
♻ ☆ Substitution for minimizing/maximizing a tropical linear (fractional) programming
Tropical polyhedra seem to play a central role in static analysis of softwares. These tropical geometrical objects play also a central role in parity games especially mean payoff games and energy games. And determining if an initial state of such game leads to win the game is known to be equivalent to solve a tropical linear optimization problem. This paper mainly focus on the tropical linear minimization problem using a special substitution method on the tropical cone obtained by homogenization of the initial tropical polyhedron. But due to a particular case which can occur in the minimization process based on substitution we have to switch on a maximization problem. Nevertheless, forward-backward substitution is known to be strongly polynomial. The special substitution developed in this paper inherits the strong polynomiality of the classical substitution for linear systems. This special substitution must not be confused with the exponential execution time of the tropical Fourier-Motzkin elimination. Tropical fractional minimization problem with linear objective functions is also solved by tropicalizing the Charnes-Cooper's transformation of a fractional linear program into a linear program developed in the usual linear algebra. Let us also remark that no particular assumption is made on the polyhedron of interest. Finally, the substitution method is illustrated on some examples borrowed from the litterature.
♻ ☆ Forward-backward splitting in bilaterally bounded Alexandrov spaces
With the goal of solving optimisation problems on non-Riemannian manifolds, such as geometrical surfaces with sharp edges, we develop and prove the convergence of a forward-backward method in Alexandrov spaces with curvature bounded both from above and from below. This bilateral boundedness is crucial for the availability of both the gradient and proximal steps, instead of just one or the other. We numerically demonstrate the behaviour of the proposed method on simple geometrical surfaces in $\mathbb{R}^3$.
♻ ☆ CompressedScaffnew: The First Theoretical Double Acceleration of Communication from Local Training and Compression in Distributed Optimization
In distributed optimization, a large number of machines alternate between local computations and communication with a coordinating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce this burden and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose CompressedScaffnew, the first algorithm for distributed optimization that jointly harnesses these two strategies and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
♻ ☆ Pruning for efficient deterministic global optimization over trained ReLU neural networks
Neural networks are increasingly used as surrogates in optimization problems to replace computationally expensive models. However, embedding ReLU neural networks in mathematical programs introduces significant computational challenges, particularly for deep and wide networks, due to both the formulation of the ReLU disjunction and the resulting large-scale optimization problem. This work investigates how pruning techniques can accelerate the solution of optimization problems with embedded neural networks, focusing on the mechanisms underlying the computational gains. We provide theoretical insights into how both unstructured (weight) and structured (node) pruning affect the ReLU big-M formulation, showing that pruning monotonically tightens preactivation bounds. We conduct comprehensive empirical studies across multiple network architectures using an illustrative test function and a realistic chemical process flowsheet optimization case study. Our results show that pruning achieves speedups of up to three to four orders of magnitude, with computational gains attributed to three key factors: (i) reduction in problem size, (ii) decrease in the number of integer variables, and (iii) tightening of big-M bounds. Weight pruning is particularly effective for deep, narrow networks, while node pruning performs better for shallow, wide or medium-sized networks. In the chemical engineering case study, pruning enabled convergence within seconds for problems that were otherwise intractable. We recommend adopting pruning as standard practice when developing neural network surrogates for optimization, especially for engineering applications requiring repeated optimization solves.
♻ ☆ Nonlinear MPC for Feedback-Interconnected Systems: a Suboptimal and Reduced-Order Model Approach
In this paper, we propose a suboptimal and reduced-order Model Predictive Control (MPC) architecture for discrete-time feedback-interconnected systems. The numerical MPC solver: (i) acts suboptimally, performing only a finite number of optimization iterations at each sampling instant, and (ii) relies only on a reduced-order model that neglects part of the system dynamics, either due to unmodeled effects or the presence of a low-level compensator. We prove that the closed-loop system resulting from the interconnection of the suboptimal and reduced-order MPC optimizer with the full-order plant has a globally exponentially stable equilibrium point. Specifically, we employ timescale separation arguments to characterize the interaction between the components of the feedback-interconnected system. The analysis relies on an appropriately tuned timescale parameter accounting for how fast the system dynamics are sampled. The theoretical results are validated through numerical simulations on a mechatronic system consisting of a pendulum actuated by a DC motor.
♻ ☆ GradPower: Powering Gradients for Faster Language Model Pre-Training
We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.
comment: 22 pages. A revised version is in preparation
♻ ☆ Quantitative Uniqueness of Kantorovich Potentials
This paper studies the uniqueness of solutions to the dual optimal transport problem, both qualitatively and quantitatively (bounds on the diameter of the set of optimisers). On the qualitative side, we prove that when one marginal measure's support is rectifiably connected (path-connected by rectifiable paths), the optimal dual potentials are unique up to a constant. This represents the first uniqueness result applicable even when both marginal measures are concentrated on lower-dimensional subsets of the ambient space, and also applies in cases where optimal potentials are nowhere differentiable on the supports of the marginals. On the quantitative side, we control the diameter of the set of optimal dual potentials by the Hausdorff distance between the support of one of the marginal measures and a regular connected set. In this way, we quantify the extent to which optimisers are almost unique when the support of one marginal measure is almost connected. This is a consequence of a novel characterisation of the set of optimal dual potentials as the intersection of an explicit family of half-spaces.
♻ ☆ Sufficiently Regularized Nonnegative Quartic Polynomials are Sum-of-Squares
A polynomial that is nonnegative need not be a sum of squares of polynomials. This classical gap, identified by Hilbert in 1888, lies at the heart of why the global optimization of multivariate quartic polynomials is NP-hard. Yet we show that this gap is closed when using (sufficient) regularization, which fundamentally alters the algebraic structure of the problem. Namely, we investigate a class of quartically-regularized cubic polynomials which arise naturally in polynomial optimization and higher-order tensor methods for nonconvex problems. We show that, under mild assumptions and for sufficiently large Euclidean quartic regularization, the shifted nonnegative polynomial becomes a sum of squares, yielding an exact semidefinite programming (SDP) formulation at the zeroth level of the Lasserre hierarchy. We further derive explicit bounds on the regularization parameter that guarantee this property. Beyond this asymptotic regime, we identify structured subclasses for which SoS exactness holds for all regularization levels, including quadratic-quartic models and a class of low-rank-type cubic tensors. In contrast, we show that separable quartic regularized polynomials -- including classical tensor models proposed by Schnabel (1991) -- do not, in general, induce SoS representations, even under arbitrarily large regularization. Our results reveal a sharp structural boundary between tractable and intractable regimes in polynomial optimization. In particular, they explain why Euclidean quartic regularization plays a significant role: in addition to regularising the model, it can induce exact SoS certificates and exact SDP representations.
♻ ☆ Optimal Control in Age-Structured Populations: A Comparison of Rate-Control and Effort-Control
This paper investigates the dynamics and optimal harvesting of age-structured populations governed by McKendrick--von Foerster equations, contrasting two distinct harvesting mechanisms: rate-control and effort-control. For the rate-control formulation, where harvesting acts as a direct additive removal term, we derive first-order necessary optimality conditions of Pontryagin type for the associated infinite-horizon optimal control problem, explicitly characterizing the adjoint system, transversality conditions, and control switching laws. In contrast, the effort-control formulation introduces harvesting as a multiplicative mortality intensity dependent on aggregate population size. We demonstrate that this aggregate dependence structurally alters the optimality system, formally generating a nonlocal coupling term in the adjoint equation that links all ages through the total stock. By combining rigorous variational control derivations, and explicit stationary profiles, this work clarifies the profound mathematical and bioeconomic distinctions between additive and multiplicative harvesting strategies.
♻ ☆ Cost-Effective Strategies for Infectious Diseases: A Multi-Objective Framework with an Interactive Dashboard
During an infectious disease outbreak, policymakers must balance medical costs with social and economic burdens and seek interventions that minimize both. To support this decision-making process, we developed a framework that integrates multi-objective optimization, cost-benefit analysis, and an interactive dashboard. This platform enables users to input cost parameters and immediately obtain a cost-optimal intervention strategy. We applied this framework to the early outbreak of COVID-19 in South Korea. The results showed that cost-optimal solutions for costs per infection ranging from 4,410 USD to 361,000 USD exhibited a similar pattern. This indicates that once the cost per infection is specified, our approach generates the corresponding cost-optimal solution without additional calculations. Our framework supports decision-making by accounting for trade-offs between policy and infection costs. It delivers rapid optimization and cost-benefit analysis results, enabling timely and informed decision-making during the critical phases of a pandemic.
comment: 29 pages, 4 figures in main body, 5 figures in appendix, a homepage attached http://moo.infdissim.com, GitHub code provided https://github.com/ljm1729-scholar/MOO_framework
♻ ☆ Difference-of-Convex Elastic Net for Compressed Sensing
This work proposes a novel and unified sparse recovery framework, termed the difference of convex Elastic Net (DCEN). This framework effectively balances strong sparsity promotion with solution stability, and is particularly suitable for high-dimensional variable selection involving highly correlated features. Built upon a difference-of-convex (DC) structure, DCEN employs two continuously tunable parameters to unify classical and state-of-the-art models--including LASSO, Elastic Net, Ridge, and $\ell_1-α\ell_2$--as special cases. Theoretically, sufficient conditions for exact and stable recovery are established under the restricted isometry property (RIP), an oracle inequality and recovery bound are derived for the global solution, and a closed-form expression of the DCEN regularization proximal operator is obtained. Moreover, two efficient optimization algorithms are developed based on the DC algorithm (DCA) and the alternating direction method of multipliers (ADMM). Within the Kurdyka-Łojasiewicz (KŁ) framework, the global convergence of DCA and its linear convergence rate are rigorously established. Furthermore, DCEN is extended to image reconstruction by incorporating total variation (TV) regularization, yielding the DCEN-TV model, which is efficiently solved via the Split Bregman method. Numerical experiments demonstrate that DCEN consistently outperforms state-of-the-art methods in sparse signal recovery, high-dimensional variable selection under strong collinearity, and Magnetic Resonance Imaging (MRI) image reconstruction, achieving superior recovery accuracy and robustness.
♻ ☆ A Polynomial-Time Algorithm for Variational Inequalities under the Minty Condition
Solving (Stampacchia) variational inequalities (SVIs) is a foundational problem at the heart of optimization. However, this expressivity comes at the cost of computational hardness. As a result, most research has focused on carving out specific subclasses that elude those intractability barriers. A classical property that goes back to the 1960s is the Minty condition, which postulates that the Minty VI (MVI) problem admits a solution. In this paper, we establish the first polynomial-time algorithm -- with complexity growing polynomially in the dimension $d$ and $\log(1/ε)$ -- for solving $ε$-SVIs for Lipschitz continuous mappings under the Minty condition. Prior approaches either incurred an exponentially worse dependence on $1/ε$ (and other natural parameters of the problem) or made more restrictive assumptions, such as monotonicity. To do so, we introduce a new variant of the ellipsoid algorithm whereby separating hyperplanes are obtained after taking a descent step from the center of the ellipsoid. It succeeds even though the set of SVIs can be nonconvex and not fully dimensional. Moreover, when our algorithm is applied to an instance with no MVI solution and fails to identify an SVI solution, it produces a succinct certificate of MVI infeasibility. We also show that deciding whether the Minty condition holds is $\mathsf{coNP}$-complete, thereby establishing that the disjunction of those two problems is polynomial-time solvable even though each problem is individually intractable. We provide several extensions and new applications of our main results. Most notably, we obtain the first polynomial-time algorithms for computing Nash equilibria in multi-player harmonic games. Finally, in two-player general-sum concave games, we give the first polynomial-time algorithm that outputs either a Nash equilibrium or a strict coarse correlated equilibrium.
comment: V3 polishes the writing and makes a correction to Theorem 5.2
♻ ☆ Finite-Time Analysis of Gradient Descent for Shallow Transformers
Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and compare Transformers with recurrent architectures on an autoregressive task.
comment: AISTATS 2026 camera-ready version
♻ ☆ Positivstellensätze for polynomial matrices with universal quantifiers
This paper investigates Positivstellensätze for polynomial matrices subject to universally quantified polynomial matrix inequality constraints. We first establish a matrix-valued Positivstellensatz under the Archimedean condition, incorporating universal quantifiers. For scalar-valued polynomial objectives, we further develop a sparse Positivstellensatz that leverages correlative sparsity patterns within these quantified constraints. Moving beyond the Archimedean framework, we then derive two generalized Positivstellensätze under analogous settings. These results collectively unify and extend foundational theorems in three distinct contexts: classical polynomial Positivstellensätze, their universally quantified counterparts, and matrix polynomial formulations. Applications of the established Positivstellensätze to robust polynomial matrix inequality constrained optimization are also investigated.
comment: 26 pages, 2 figures
♻ ☆ Well-posedness of McKean-Vlasov generalized multivalued BSDEs
This paper investigates McKean-Vlasov backward stochastic variational inequalities (BSVIs) whose generator depends on the joint law of the solution. We first establish the existence and uniqueness of the solution under globally Lipschitz and linear growth conditions. The analysis is then extended to the more general case of locally Lipschitz and non-linear growth conditions.
Robotics 69
☆ Distal-Stable Beam for Continuum Robots
Continuum robots are well suited for constrained environments but suffer from low distal stiffness, resulting in large posture errors under external loads. In this paper, we propose a novel structural primitive, the Distal-Stable Beam, which achieves a strong stiffness gradient through purely geometric design, maintaining compliance in the intermediate section while ensuring high distal rigidity. The structure consists of two parallel rods and one convergent rod constrained by guide disks, introducing geometric coupling that suppresses deformation modes and preserves distal posture. Experiments show that the distal stiffness is 12 times higher than at the center, corresponding to an approximately 100-fold improvement over a conventional cantilever beam. The proposed mechanism enables simultaneous compliance and distal stability without active stiffness modulation, providing a new design approach for continuum robots requiring both safety and precision.
comment: 8 pages, 7 figures
☆ Efficient Equivariant Transformer for Self-Driving Agent Modeling CVPR 2026
Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra $\mathbb{R}^*_{2,0,1}$ and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.
comment: CVPR 2026
☆ Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis
Physically Assistive Robots (PARs) require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause severe physical and cognitive fatigue for users with profound motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework (OTPF). This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, independent clinical experts confirmed the generated policies are safe and accurately reflect user preferences.
comment: This work has been submitted to the 2026 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
☆ Neural Robust Control on Lie Groups Using Contraction Methods (Extended Version)
In this paper, we propose a learning framework for synthesizing a robust controller for dynamical systems evolving on a Lie group. A robust control contraction metric (RCCM) and a neural feedback controller are jointly trained to enforce contraction conditions on the Lie group manifold. Sufficient conditions are derived for the existence of such an RCCM and neural controller, ensuring that the geometric constraints imposed by the manifold structure are respected while establishing a disturbance-dependent tube that bounds the output trajectories. As a case study, a feedback controller for a quadrotor is designed using the proposed framework. Its performance is evaluated using numerical simulations and compared with a geometric controller.
comment: An extended version of the conference paper submitted for publication in IEEE Conference of Decision and Control
☆ Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation
Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been proposed to integrate vision and F/T signals, including auxiliary prediction objectives, mixture-of-experts architectures, and contact-aware gating mechanisms, a comparison of these approaches remains lacking. In this work, we provide a comparison study of different F/T-vision integration strategies within diffusion-based manipulation policies. In addition, we propose an adaptive integration strategy that ignores F/T signals during non-contact phases while adaptively leveraging both vision and torque information during contact. Experimental results demonstrate that our method outperforms the strongest baseline by 14% in success rate, highlighting the importance of contact-aware multimodal fusion for robotic manipulation.
☆ A soft and lightweight fabric-based pneumatic interface for multimodal fingertip tactile feedback
Wearable fingertip haptic devices are critical for realistic interaction in virtual reality, augmented reality, and teleoperation, yet existing approaches struggle to simultaneously achieve adequate tactile output, low mass, simple fabrication, and untethered portability. Here we show that fabric-based pneumatic actuation can address this gap. Our device comprises four pneumatic chambers fabricated from thermoplastic polyurethane-coated fabric via computer numerical control heat-sealing, yielding a soft, conformable interface weighing 2.1 g that operates untethered with a wrist-mounted control unit. Mechanical and dynamic characterization confirms that the fabric actuators produce sufficient force, displacement, and bandwidth for fingertip tactile rendering. A psychophysical study with 15 participants demonstrates classification accuracy exceeding 90% across three distinct tactile modes -- contact configuration, directional sliding, and vibrotactile frequency. These findings establish fabric-based pneumatic actuation as a viable technology route for lightweight, low-cost, and multimodal fingertip haptic interfaces.
☆ AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction
Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.
☆ Open-loop POMDP Simplification and Safe Skipping of Replanning with Formal Performance Guarantees
Partially Observable Markov Decision Processes (POMDPs) provide a principled mathematical framework for decision-making under uncertainty. However, the exact solution to POMDPs is computationally intractable. In this paper, we address the computational intractability by introducing a novel framework for adaptive open-loop simplification with formal performance guarantees. Our method adaptively interleaves open-loop and closed-loop planning via a topology-based belief tree, enabling a significant reduction in planning complexity. The key contribution lies in the derivation of efficiently computable bounds which provide formal guarantees and can be used to ensure that our simplification can identify the immediate optimal action of the original POMDP problem. Our framework therefore provides computationally tractable performance guarantees for macro-actions within POMDPs. Furthermore, we propose a novel framework for safely skipping replanning during execution, supported by theoretical guarantees on multi-step open-loop action sequences. To the best of our knowledge, this framework is the first to address skipping replanning with formal performance guarantees. Practical online solvers for our proposed simplification are developed, including a sampling-based solver and an anytime solver. Empirical results demonstrate substantial computational speedups while maintaining provable performance guarantees, advancing the tractability and efficiency of POMDP planning.
comment: 18 pages, 5 figures. Accepted to WAFR 2026
☆ Safety, Security, and Cognitive Risks in World Models RSS
World models -- learned internal simulators of environment dynamics -- are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause catastrophic failures in safety-critical deployments. World model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking precisely because they can simulate the consequences of their own actions. Authoritative world model predictions further foster automation bias and miscalibrated human trust that operators lack the tools to audit. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker capability taxonomy; and develops a unified threat model extending MITRE ATLAS and the OWASP LLM Top 10 to the world model stack. We provide an empirical proof-of-concept on trajectory-persistent adversarial attacks (GRU-RSSM: A_1 = 2.26x amplification, -59.5% reduction under adversarial fine-tuning; stochastic RSSM proxy: A_1 = 0.65x; DreamerV3 checkpoint: non-zero action drift confirmed). We illustrate risks through four deployment scenarios and propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design. We argue that world models must be treated as safety-critical infrastructure requiring the same rigour as flight-control software or medical devices.
comment: 26 pages, 1 figure (6 panels), 2 tables. Empirical proof-of-concept on GRU/RSSM/DreamerV3 architectures
☆ Functional Force-Aware Retargeting from Virtual Human Demos to Soft Robot Policies
We introduce SoftAct, a framework for teaching soft robot hands to perform human-like manipulation skills by explicitly reasoning about contact forces. Leveraging immersive virtual reality, our system captures rich human demonstrations, including hand kinematics, object motion, dense contact patches, and detailed contact force information. Unlike conventional approaches that retarget human joint trajectories, SoftAct employs a two-stage, force-aware retargeting algorithm. The first stage attributes demonstrated contact forces to individual human fingers and allocates robot fingers proportionally, establishing a force-balanced mapping between human and robot hands. The second stage performs online retargeting by combining baseline end-effector pose tracking with geodesic-weighted contact refinements, using contact geometry and force magnitude to adjust robot fingertip targets in real time. This formulation enables soft robotic hands to reproduce the functional intent of human demonstrations while naturally accommodating extreme embodiment mismatch and nonlinear compliance. We evaluate SoftAct on a suite of contact-rich manipulation tasks using a custom non-anthropomorphic pneumatic soft robot hand. SoftAct's controller reduces fingertip trajectory tracking RMSE by up to 55 percent and reduces tracking variance by up to 69 percent compared to kinematic and learning-based baselines. At the policy level, SoftAct achieves consistently higher success in zero-shot real-world deployment and in simulation. These results demonstrate that explicitly modeling contact geometry and force distribution is essential for effective skill transfer to soft robotic hands, and cannot be recovered through kinematic imitation alone. Project videos and additional details are available at https://soft-act.github.io/.
Collaborative Task and Path Planning for Heterogeneous Robotic Teams using Multi-Agent PPO
Efficient robotic extraterrestrial exploration requires robots with diverse capabilities, ranging from scientific measurement tools to advanced locomotion. A robotic team enables the distribution of tasks over multiple specialized subsystems, each providing specific expertise to complete the mission. The central challenge lies in efficiently coordinating the team to maximize utilization and the extraction of scientific value. Classical planning algorithms scale poorly with problem size, leading to long planning cycles and high inference costs due to the combinatorial growth of possible robot-target allocations and possible trajectories. Learning-based methods are a viable alternative that move the scaling concern from runtime to training time, setting a critical step towards achieving real-time planning. In this work, we present a collaborative planning strategy based on Multi-Agent Proximal Policy Optimization (MAPPO) to coordinate a team of heterogeneous robots to solve a complex target allocation and scheduling problem. We benchmark our approach against single-objective optimal solutions obtained through exhaustive search and evaluate its ability to perform online replanning in the context of a planetary exploration scenario.
comment: 8 pages, 3 figures, associated code on https://github.com/leggedrobotics/multi_robot_global_planner
☆ A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems
Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper
comment: 5 pages, 1 figure
☆ SMASH: Mastering Scalable Whole-Body Skills for Humanoid Ping-Pong with Egocentric Vision
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole-body coordination for precise task execution. These limitations stem from two core challenges: achieving low-latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task-aligned strike motions for learning precise yet natural whole-body behaviors. In this work, we present \methodname, a modular system for agile humanoid table tennis that unifies scalable whole-body skill learning with onboard egocentric perception, eliminating the need for external cameras during deployment. Our work advances prior humanoid table-tennis systems in three key aspects. First, we achieve agile and precise ball interaction with tightly coordinated whole-body control, rather than relying on decoupled upper- and lower-body behaviors. This enables the system to exhibit diverse strike motions, including explosive whole-body smashes and low crouching shots. Second, by augmenting and diversifying strike motions with a generative model, our framework benefits from scalable motion priors and produces natural, robust striking behaviors across a wide workspace. Third, to the best of our knowledge, we demonstrate the first humanoid table-tennis system capable of consecutive strikes using onboard sensing alone, despite the challenges of low-latency perception, ego-motion-induced instability, and limited field of view. Extensive real-world experiments demonstrate stable and precise ball exchanges under high-speed conditions, validating scalable, perception-driven whole-body skill learning for dynamic humanoid interaction tasks.
☆ Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking
Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.
☆ VRUD: A Drone Dataset for Complex Vehicle-VRU Interactions within Mixed Traffic
The Operational Design Domain (ODD) of urbanoriented Level 4 (L4) autonomous driving, especially for autonomous robotaxis, confronts formidable challenges in complex urban mixed traffic environments. These challenges stem mainly from the high density of Vulnerable Road Users (VRUs) and their highly uncertain and unpredictable interaction behaviors. However, existing open-source datasets predominantly focus on structured scenarios such as highways or regulated intersections, leaving a critical gap in data representing chaotic, unstructured urban environments. To address this, this paper proposes an efficient, high-precision method for constructing drone-based datasets and establishes the Vehicle-Vulnerable Road User Interaction Dataset (VRUD), as illustrated in Figure 1. Distinct from prior works, VRUD is collected from typical "Urban Villages" in Shenzhen, characterized by loose traffic supervision and extreme occlusion. The dataset comprises 4 hours of 4K/30Hz recording, containing 11,479 VRU trajectories and 1,939 vehicle trajectories. A key characteristic of VRUD is its composition: VRUs account for about 87% of all traffic participants, significantly exceeding the proportions in existing benchmarks. Furthermore, unlike datasets that only provide raw trajectories, we extracted 4,002 multi-agent interaction scenarios based on a novel Vector Time to Collision (VTTC) threshold, supported by standard OpenDRIVE HD maps. This study provides valuable, rare edge-case resources for enhancing the safety performance of ADS in complex, unstructured urban environments. To facilitate further research, we have made the VRUD dataset open-source at: https://zzi4.github.io/VRUD/.
☆ ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction CVPR 2026
3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.
comment: Accepted to CVPR 2026. The source code is publicly available at https://github.com/7uHeng/ProOOD
☆ BAT: Balancing Agility and Stability via Online Policy Switching for Long-Horizon Whole-Body Humanoid Control
Despite recent advances in control, reinforcement learning, and imitation learning, developing a unified framework that can achieve agile, precise, and robust whole-body behaviors, particularly in long-horizon tasks, remains challenging. Existing approaches typically follow two paradigms: coupled whole-body policies for global coordination and decoupled policies for modular precision. However, without a systematic method to integrate both, this trade-off between agility, robustness, and precision remains unresolved. In this work, we propose BAT, an online policy-switching framework that dynamically selects between two complementary whole-body RL controllers to balance agility and stability across different motion contexts. Our framework consists of two complementary modules: a switching policy learned via hierarchical RL with an expert guidance from sliding-horizon policy pre-evaluation, and an option-aware VQ-VAE that predicts option preference from discrete motion token sequences for improved generalization. The final decision is obtained via confidence-weighted fusion of two modules. Extensive simulations and real-world experiments on the Unitree G1 humanoid robot demonstrate that BAT enables versatile long-horizon loco-manipulation and outperforms prior methods across diverse tasks.
☆ Stein Variational Uncertainty-Adaptive Model Predictive Control
We propose a Stein variational distributionally robust controller for nonlinear dynamical systems with latent parametric uncertainty. The method is an alternative to conservative worst-case ambiguity-set optimization with a deterministic particle-based approximation of a task-dependent uncertainty distribution, enabling the controller to concentrate on parameter sensitivities that most strongly affect closed-loop performance. Our method yields a controller that is robust to latent parameter uncertainty by coupling optimal control with Stein variational inference, and avoiding restrictive parametric assumptions on the uncertainty model while preserving computational parallelism. In contrast to classical DRO, which can sacrifice nominal performance through worst-case design, we find our approach achieves robustness by shaping the control law around relevant uncertainty that are most critical to the task objective. The proposed framework therefore reconciles robust control and variational inference in a single decision-theoretic formulation for broad classes of control systems with parameter uncertainty. We demonstrate our approach on representative control problems that empirically illustrate improved performance-robustness tradeoffs over nominal, ensemble, and classical distributionally robust baselines.
☆ Infinite-Horizon Ergodic Control via Kernel Mean Embeddings
This paper derives an infinite-horizon ergodic controller based on kernel mean embeddings for long-duration coverage tasks on general domains. While existing kernel-based ergodic control methods provide strong coverage guarantees on general coverage domains, their practical use has been limited to sub-ergodic, finite-time horizons due to intractable computational scaling, prohibiting its use for long-duration coverage. We resolve this scaling by deriving an infinite-horizon ergodic controller equipped with an extended kernel mean embedding error visitation state that recursively records state visitation. This extended state decouples past visitation from future control synthesis and expands ergodic control to infinite-time settings. In addition, we present a variation of the controller that operates on a receding-horizon control formulation with the extended error state. We demonstrate theoretical proof of asymptotic convergence of the derived controller and show preservation of ergodic coverage guarantees for a class of 2D and 3D coverage problems.
comment: 8 pages, 11 figures
☆ Focal plane wavefront control with model-based reinforcement learning
The direct imaging of potentially habitable exoplanets is one prime science case for high-contrast imaging instruments on extremely large telescopes. Most such exoplanets orbit close to their host stars, where their observation is limited by fast-moving atmospheric speckles and quasi-static non-common-path aberrations (NCPA). Conventional NCPA correction methods often use mechanical mirror probes, which compromise performance during operation. This work presents machine-learning-based NCPA control methods that automatically detect and correct both dynamic and static NCPA errors by leveraging sequential phase diversity. We extend previous work in reinforcement learning for AO to focal plane control. A new model-based RL algorithm, Policy Optimization for NCPAs (PO4NCPA), interprets the focal-plane image as input data and, through sequential phase diversity, determines phase corrections that optimize both non-coronagraphic and post-coronagraphic PSFs without prior system knowledge. Further, we demonstrate the effectiveness of this approach by numerically simulating static NCPA errors on a ground-based telescope and an infrared imager affected by water-vapor-induced seeing (dynamic NCPAs). Simulations show that PO4NCPA robustly compensates static and dynamic NCPAs. In static cases, it achieves near-optimal focal-plane light suppression with a coronagraph and near-optimal Strehl without one. With dynamics NCPA, it matches the performance of the modal least-squares reconstruction combined with a 1-step delay integrator in these metrics. The method remains effective for the ELT pupil, vector vortex coronagraph, and under photon and background noise. PO4NCPA is model-free and can be directly applied to standard imaging as well as to any coronagraph. Its sub-millisecond inference times and performance also make it suitable for real-time low-order correction of atmospheric turbulence beyond HCI.
comment: 13 pages, 11 figures accepted by A&A
☆ An Integrated Soft Robotic System for Measuring Vital Signs in Search and Rescue Environments
Robots are frequently utilized in search-and-rescue operations. In recent years, significant advancements have been made in the field of victim assessment. However, there are still open issues regarding heart rate measurement, and no studies have been found that assess pressure in post-disaster scenarios. This work designs a soft gripper and integrates it into a mobile robotic system, thereby creating a device capable of measuring the pulse and blood pressure of victims in post-disaster environments. The gripper is designed to envelop the victim's arm and inflate like a sphygmomanometer, facilitated by a specialized portability system. The utilization of different signal processing algorithms has enabled the attainment of a pulse bias of \qty{4}{\bpm} and a bias of approximately \qty{5}{\mmHg} for systolic and diastolic pressures. The findings, in conjunction with the other statistical data and the validation of homoscedasticity in the error terms, prove the system's capacity to accurately determine heart rate and blood pressure, thereby rendering it suitable for search and rescue operations. Finally, a post-disaster has been employed as a test to validate the functionality of the entire system and to demonstrate its capacity to adapt to various victim positions, its measurement speed, and its safety for victims.
☆ PanoAir: A Panoramic Visual-Inertial SLAM with Cross-Time Real-World UAV Dataset
Accurate pose estimation is fundamental for unmanned aerial vehicle (UAV) applications, where Visual-Inertial SLAM (VI-SLAM) provides a cost-effective solution for localization and mapping. However, existing VI-SLAM methods mainly rely on sensors with limited fields of view (FoV), which can lead to drift and even failure in complex UAV scenarios. Although panoramic cameras provide omnidirectional perception to improve robustness, panoramic VI-SLAM and corresponding real-world datasets for UAVs remain underexplored. To address this limitation, we first construct a real-world panoramic visual-inertial dataset covering diverse flight conditions, including varying illumination, altitudes, trajectory lengths, and motion dynamics. To achieve accurate and robust pose estimation under such challenging UAV scenarios, we propose a panoramic VI-SLAM framework that exploits the omnidirectional FoV via the proposed panoramic feature extraction and panoramic loop closure, enhancing feature constraints and ensuring global consistency. Extensive experiments on both the proposed dataset and public benchmarks demonstrate that our method achieves superior accuracy, robustness, and consistency compared to existing approaches. Moreover, deployment on embedded platform validates its practical applicability, achieving comparable computational efficiency to PC implementations. The source code and dataset are publicly available at https://drive.google.com/file/d/1lG1Upn6yi-N6tYpEHAt6dfR1uhzNtWbT/view
☆ DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
comment: Code is available at \href{https://github.com/wzzheng/DVGT}
☆ Compact Keyframe-Optimized Multi-Agent Gaussian Splatting SLAM
Efficient multi-agent 3D mapping is essential for robotic teams operating in unknown environments, but dense representations hinder real-time exchange over constrained communication links. In multi-agent Simultaneous Localization and Mapping (SLAM), systems typically rely on a centralized server to merge and optimize the local maps produced by individual agents. However, sharing these large map representations, particularly those generated by recent methods such as Gaussian Splatting, becomes a bottleneck in real-world scenarios with limited bandwidth. We present an improved multi-agent RGB-D Gaussian Splatting SLAM framework that reduces communication load while preserving map fidelity. First, we incorporate a compaction step into our SLAM system to remove redundant 3D Gaussians, without degrading the rendering quality. Second, our approach performs centralized loop closure computation without initial guess, operating in two modes: a pure rendered-depth mode that requires no data beyond the 3D Gaussians, and a camera-depth mode that includes lightweight depth images for improved registration accuracy and additional Gaussian pruning. Evaluation on both synthetic and real-world datasets shows up to 85-95\% reduction in transmitted data compared to state-of-the-art approaches in both modes, bringing 3D Gaussian multi-agent SLAM closer to practical deployment in real-world scenarios. Code: https://github.com/lemonci/coko-slam
☆ Bench2Drive-VL: Benchmarks for Closed-Loop Autonomous Driving with Vision-Language Models
With the rise of vision-language models (VLM), their application for autonomous driving (VLM4AD) has gained significant attention. Meanwhile, in autonomous driving, closed-loop evaluation has become widely recognized as a more reliable validation method than open-loop evaluation, as it can evaluate the performance of the model under cumulative errors and out-of-distribution inputs. However, existing VLM4AD benchmarks evaluate the model`s scene understanding ability under open-loop, i.e., via static question-answer (QA) dataset. This kind of evaluation fails to assess the VLMs performance under out-of-distribution states rarely appeared in the human collected datasets.To this end, we present Bench2Drive-VL, an extension of Bench2Drive that brings closed-loop evaluation to VLM-based driving, which introduces: (1) DriveCommenter, a closed-loop generator that automatically generates diverse, behavior-grounded question-answer pairs for all driving situations in CARLA,including severe off-route and off-road deviations previously unassessable in simulation. (2) A unified protocol and interface that allows modern VLMs to be directly plugged into the Bench2Drive closed-loop environment to compare with traditional agents. (3) A flexible reasoning and control framework, supporting multi-format visual inputs and configurable graph-based chain-of-thought execution. (4) A complete development ecosystem. Together, these components form a comprehensive closed-loop benchmark for VLM4AD. All codes and annotated datasets are open sourced.
comment: All codes and annotated datasets are available at \url{https://github.com/Thinklab-SJTU/Bench2Drive-VL} and \url{https://huggingface.co/datasets/Telkwevr/Bench2Drive-VL-base}
☆ A Dual-Action Fabric-Based Soft Robotic Glove for Ergonomic Hand Rehabilitation
Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and flexion-extension actuation constrain their clinical utility. Here, we present a dual-action fabric-based soft robotic glove incorporating customized actuators aligned with individual finger joints. The glove comprises five independently controlled dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby maximizing finger contact area and improving comfort. Systematic characterization confirmed that the actuators generate sufficient joint moment and fingertip force for ADL-relevant tasks, and that the complete glove system produces adequate grasping force for common household objects. A preliminary study with ten healthy subjects demonstrated that active glove assistance significantly reduces forearm muscle activity during object manipulation. A pilot feasibility study with three individuals with cervical spinal cord injury across seven functional tasks indicated that glove assistance promotes more natural grasp patterns and reduces reliance on tenodesis grasp, although at the cost of increased task completion time attributable to the current actuation interface. This customizable, ergonomic design represents a practical step toward personalized hand rehabilitation and assistive robotics.
☆ A wearable haptic device for edge and surface simulation
Object manipulation is fundamental to virtual reality (VR) applications, yet conventional fingertip haptic devices fail to render certain tactile features relevant for immersive and precise interactions, as i.e. detection of edges. This paper presents a compact, lightweight fingertip haptic device (24.3 g) that delivers distinguishable surface and edge contact feedback through a novel dual-motor mechanism. Pressure distribution characterization using a 6 x 6 flexible sensor array demonstrates distinct contact patterns between the two stimulation modes. A preliminary user study with five participants achieved 93% average classification accuracy across four conditions (edge/surface contact with light/heavy pressure), with mean response times of 2.79 seconds. The results indicate that the proposed device can effectively convey edge and surface tactile cues, potentially enhancing object manipulation fidelity in VR environments.
☆ How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands ICRA
Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViTs potential to make tactile sensing more scalable and practical for real-world robotic applications.
comment: Accepted for publication at the International Conference on Robotics and Automation (ICRA) 2026, Vienna
☆ SoftHand Model-W: A 3D-Printed, Anthropomorphic, Underactuated Robot Hand with Integrated Wrist and Carpal Tunnel ICRA
This paper presents the SoftHand Model-W: a 3D-printed, underactuated, anthropomorphic robot hand based on the Pisa/IIT SoftHand, with an integrated antagonistic tendon mechanism and 2 degree-of-freedom tendon-driven wrist. These four degrees-of-acuation provide active flexion and extension to the five fingers, and active flexion/extension and radial/ulnar deviation of the palm through the wrist, while preserving the synergistic and self-adaptive features of such SoftHands. A carpal tunnel-inspired tendon routing allows remote motor placement in the forearm, reducing distal inertia and maintaining a compact form factor. The SoftHand-W is mounted on a 6-axis robot arm and tested with two reorientation tasks requiring coordination between the hand and arm's pose: cube stacking and in-plane disc rotation. Results comparing task time, arm joint travel, and configuration changes with and without wrist actuation show that adding the wrist reduces compensatory and reconfiguration movements of the arm for a quicker task-completion time. Moreover, the wrist enables pick-and-place operations that would be impossible otherwise. Overall, the SoftHand Model-W demonstrates how proximal degrees of freedom are key to achieving versatile, human-like manipulation in real world robotic applications, with a compact design enabling deployment in research and assistive settings.
comment: Accepted for publication at the International Conference of Robotics and Automation (ICRA) 2026, Vienna
☆ LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics
Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.
comment: Submitted to IEEE ICIP 2026. Under review
☆ StretchBot: A Neuro-Symbolic Framework for Adaptive Guidance with Assistive Robots
Assistive robots have growing potential to support physical wellbeing in home and healthcare settings, for example, by guiding users through stretching or rehabilitation routines. However, existing systems remain largely scripted, which limits their ability to adapt to user state, environmental context, and interaction dynamics. In this work, we present StretchBot, a hybrid neuro-symbolic robotic coach for adaptive assistive guidance. The system combines multimodal perception with knowledge-graph-grounded large language model reasoning to support context-aware adjustments during short stretching sessions while maintaining a structured routine. To complement the system description, we report an exploratory pilot comparison between scripted and adaptive guidance with three participants. The pilot findings suggest that the adaptive condition improved perceived adaptability and contextual relevance, while scripted guidance remained competitive in smoothness and predictability. These results provide preliminary evidence that structured actionable knowledge can help ground language-model-based adaptation in embodied assistive interaction, while also highlighting the need for larger, longitudinal studies to evaluate robustness, generalizability, and long-term user experience.
☆ A Physical Imitation Learning Pipeline for Energy-Efficient Quadruped Locomotion Assisted by Parallel Elastic Joint
Due to brain-body co-evolution, animals' intrinsic body dynamics play a crucial role in energy-efficient locomotion, which shares control effort between active muscles and passive body dynamics -- a principle known as Embodied Physical Intelligence. In contrast, robot bodies are often designed with one centralised controller that typically suppress the intrinsic body dynamics instead of exploiting it. We introduce Physical Imitation Learning (PIL), which distils a Reinforcement Learning (RL) control policy into physically implementable body responses that can be directly offloaded to passive Parallel Elastic Joints (PEJs), enabling therefore the body to imitate part of the controlled behaviour. Meanwhile, the residual policy commands the motors to recover the RL policy's performance. The results is an overall reduced energy consumption thanks to outsourcing parts of the control policy to the PEJs. Here we show in simulated quadrupeds, that our PIL approach can offloads up to 87% of mechanical power to PEJs on flat terrain and 18% on rough terrain. Because the body design is distilled from -- rather than jointly optimised with -- the control policy, PIL realises brain-body co-design without expanding the search space with body design parameters, providing a computationally efficient route to task-specific Embodied Physical Intelligence applicable to a wide range of joint-based robot morphologies.
☆ Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning
The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is https://yichen928.github.io/robot_multiview.
☆ Simulating Realistic LiDAR Data Under Adverse Weather for Autonomous Vehicles: A Physics-Informed Learning Approach
Accurate LiDAR simulation is crucial for autonomous driving, especially under adverse weather conditions. Existing methods struggle to capture the complex interactions between LiDAR signals and atmospheric phenomena, leading to unrealistic representations. This paper presents a physics-informed learning framework (PICWGAN) for generating realistic LiDAR data under adverse weather conditions. By integrating physicsdriven constraints for modeling signal attenuation and geometryconsistent degradations into a physics-informed learning pipeline, the proposed method reduces the sim-to-real gap. Evaluations on real-world datasets (CADC for snow, Boreas for rain) and the VoxelScape dataset show that our approach closely mimics realworld intensity patterns. Quantitative metrics, including MSE, SSIM, KL divergence, and Wasserstein distance, demonstrate statistically consistent intensity distributions. Additionally, models trained on data enhanced by our framework outperform baselines in downstream 3D object detection, achieving performance comparable to models trained on real-world data. These results highlight the effectiveness of the proposed approach in improving the realism of LiDAR data and enabling robust perception under adverse weather conditions.
☆ Bistable Quad-Nets Composed of Four-Bar Linkages
We study mechanical structures composed of spatial four-bar linkages that are bistable, that is, they allow for two distinct configurations. They have an interpretation as quad nets in the Study quadric which can be used to prove existence of arbitrarily large structures of this type. We propose a purely geometric construction of such examples, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. The proposed construction does not rely on numerical optimization and allows control over axis positions and snap angles.
☆ Reachability-Aware Time Scaling for Path Tracking
This paper studies tracking of collision-free waypoint paths produced by an offline planner for a planar double-integrator system with bounded speed and acceleration. Because sampling-based planners must route around obstacles, the resulting waypoint paths can contain sharp turns and high-curvature regions, so one-step reachability under acceleration limits becomes critical even when the path geometry is collision-free. We build on a pure-pursuit-style, reachability-guided quadratic-program (QP) tracker with a one-step acceleration margin. Offline, we evaluate this margin along a spline fitted to the waypoint path and update a scalar speed-scaling profile so that the required one-step acceleration remains below the available bound. Online, the same look-ahead tracking structure is used to track the scaled reference.
comment: 7 pages, 5 figures
☆ Certificate-Driven Closed-Loop Multi-Agent Path Finding with Inheritable Factorization
Multi-agent coordination in automated warehouses and logistics is commonly modeled as the Multi-Agent Path Finding (MAPF) problem. Closed-loop MAPF algorithms improve scalability by planning only the next movement and replanning online, but this finite-horizon viewpoint can be shortsighted and makes it difficult to preserve global guarantees and exploit compositional structure. This issue is especially visible in Anytime Closed-Loop Conflict-Based Search (ACCBS), which applies Conflict-Based Search (CBS) over dynamically extended finite horizons but, under finite computational budgets, may terminate with short active prefixes in dense instances. We introduce certificate trajectories and their associated fleet budget as a general mechanism for filtering closed-loop updates. A certificate provides a conflict-free fallback plan and a monotone upper bound on the remaining cost; accepting only certificate-improving updates yields completeness. The same budget information induces a budget-limited factorization that enables global, inheritable decomposition across timesteps. Instantiating the framework on ACCBS yields Certificate-Driven Conflict-Based Search (CDCBS). Experiments on benchmark maps show that CDCBS achieves more consistent solution quality than ACCBS, particularly in dense settings, while the proposed factorization reduces effective group size.
☆ Learning Humanoid Navigation from Human Data
We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com
comment: 8 pages 8 figures
☆ Sampling-based Task and Kinodynamic Motion Planning under Semantic Uncertainty
This paper tackles the problem of integrated task and kinodynamic motion planning in uncertain environments. We consider a robot with nonlinear dynamics tasked with a Linear Temporal Logic over finite traces ($\ltlf$) specification operating in a partially observable environment. Specifically, the uncertainty is in the semantic labels of the environment. We show how the problem can be modeled as a Partially Observable Stochastic Hybrid System that captures the robot dynamics, $\ltlf$ task, and uncertainty in the environment state variables. We propose an anytime algorithm that takes advantage of the structure of the hybrid system, and combines the effectiveness of decision-making techniques and sampling-based motion planning. We prove the soundness and asymptotic optimality of the algorithm. Results show the efficacy of our algorithm in uncertain environments, and that it consistently outperforms baseline methods.
☆ Behavioral Score Diffusion: Model-Free Trajectory Planning via Kernel-Based Score Estimation from Data
Diffusion-based trajectory optimization has emerged as a powerful planning paradigm, but existing methods require either learned score networks trained on large datasets or analytical dynamics models for score computation. We introduce \emph{Behavioral Score Diffusion} (BSD), a training-free and model-free trajectory planner that computes the diffusion score function directly from a library of trajectory data via kernel-weighted estimation. At each denoising step, BSD retrieves relevant trajectories using a triple-kernel weighting scheme -- diffusion proximity, state context, and goal relevance -- and computes a Nadaraya-Watson estimate of the denoised trajectory. The diffusion noise schedule naturally controls kernel bandwidths, creating a multi-scale nonparametric regression: broad averaging of global behavioral patterns at high noise, fine-grained local interpolation at low noise. This coarse-to-fine structure handles nonlinear dynamics without linearization or parametric assumptions. Safety is preserved by applying shielded rollout on kernel-estimated state trajectories, identical to existing model-based approaches. We evaluate BSD on four robotic systems of increasing complexity (3D--6D state spaces) in a parking scenario. BSD with fixed bandwidth achieves 98.5\% of the model-based baseline's average reward across systems while requiring no dynamics model, using only 1{,}000 pre-collected trajectories. BSD substantially outperforms nearest-neighbor retrieval (18--63\% improvement), confirming that the diffusion denoising mechanism is essential for effective data-driven planning.
Implicit Primal-Dual Interior-Point Methods for Quadratic Programming
This paper introduces a new method for solving quadratic programs using primal-dual interior-point methods. Instead of handling complementarity as an explicit equation in the Karush-Kuhn-Tucker (KKT) conditions, we ensure that complementarity is implicitly satisfied by construction. This is achieved by introducing an auxiliary variable and relating it to the duals and slacks via a retraction map. Specifically, we prove that the softplus function has favorable numerical properties compared to the commonly used exponential map. The resulting KKT system is guaranteed to be spectrally bounded, thereby eliminating the most pressing limitation of primal-dual methods: ill-conditioning near the solution. These attributes facilitate the solution of the underlying linear system, either by removing the need to compute factorizations at every iteration, enabling factorization-free approaches like indirect solvers, or allowing the solver to achieve high accuracy in low-precision arithmetic. Consequently, this novel perspective opens new opportunities for interior-point methods, especially for solving large-scale problems to high precision.
☆ A Dual-Stream Transformer Architecture for Illumination-Invariant TIR-LiDAR Person Tracking
Robust person tracking is a critical capability for autonomous mobile robots operating in diverse and unpredictable environments. While RGB-D tracking has shown high precision, its performance severely degrades under challenging illumination conditions, such as total darkness or intense backlighting. To achieve all-weather robustness, this paper proposes a novel Thermal-Infrared and Depth (TIR-D) tracking architecture that leverages the standard sensor suite of SLAM-capable robots, namely LiDAR and TIR cameras. A major challenge in TIR-D tracking is the scarcity of annotated multi-modal datasets. To address this, we introduce a sequential knowledge transfer strategy that evolves structural priors from a large-scale thermal-trained model into the TIR-D domain. By employing a differential learning rate strategy -- referred to as ``Fine-grained Differential Learning Rate Strategy'' -- we effectively preserve pre-trained feature extraction capabilities while enabling rapid adaptation to geometric depth cues. Experimental results demonstrate that our proposed TIR-D tracker achieves superior performance, with an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7\%, significantly outperforming conventional RGB-transfer and single-modality baselines. Our approach provides a practical and resource-efficient solution for robust human-following in all-weather robotics applications.
comment: 6 pages, 4 figures, technical report
☆ Go Big or Go Home: Simulating Mobbing Behavior with Braitenbergian Robots
We used the Webots robotics simulation platform to simulate a dyadic avoiding and mobbing predator behavior in a group of Braitenbergian robots. Mobbing is an antipredator adaptation used by some animals in which the individuals cooperatively attack or harass a predator to protect themselves. One way of coordinating a mobbing attack is using mobbing calls to summon other individuals of the mobbing species. We imitated this mechanism and simulated Braitenbergian robots that use mobbing calls when they face a light source (representing an inanimate predator) and mob it if they can summon allies, otherwise, they escape from it. We explore the effects of range of mobbing call (infinite range, mid-range and low-range) and the size of the robot group (ten robots vs three) on the overall success of mobbing. Our results suggest that both variables have significant impacts. This work has implications for simulations of action selection in artificial life and designing control architectures for autonomous agents.
comment: This work was completed in 2019 as a final project for a graduate course at the University of Waterloo, titled: ECE 750 - Artificial Life: Embodied Intelligence
☆ Real Time Local Wind Inference for Robust Autonomous Navigation
This thesis presents a solution that enables aerial robots to reason about surrounding wind flow fields in real time using on board sensors and embedded flight hardware. The core novelty of this research is the fusion of range measurements with sparse in situ wind measurements to predict surrounding flow fields. We aim to address two fundamental questions: first, the sufficiency of topographical data for accurate wind prediction in dense urban environments; and second, the utility of learned wind models for motion planning with an emphasis on energy efficiency and obstacle avoidance. Drawing on tools from deep learning, fluid mechanics, and optimal control, we establish a framework for local wind prediction using navigational LiDAR, and then incorporate local wind model priors into a receding-horizon optimal controller to study how local wind knowledge affects energy use and robustness during autonomous navigation. Through simulated demonstrations in diverse urban wind scenarios we evaluate the predictive capabilities of the wind predictor, and quantify improvements to autonomous urban navigation in terms of crash rates and energy consumption when local wind information is integrated into the motion planning. Sub-scale free flight experiments in an open-air wind tunnel demonstrate that these algorithms can run in real time on an embedded flight computer with sufficient bandwidth for stable control of a small aerial robot. Philosophically, this thesis contributes a new paradigm for localized wind inference and motion planning in unknown windy environments. By enabling robots to rapidly assess local wind conditions without prior environmental knowledge, this research accelerates the introduction of aerial robots into increasingly challenging environments.
comment: PhD Thesis, University of Pennsylvania, 2026. 152 pages
♻ ☆ IA-TIGRIS: An Incremental and Adaptive Sampling-Based Planner for Online Informative Path Planning
Planning paths that maximize information gain for robotic platforms has wide-ranging applications and significant potential impact. To effectively adapt to real-time data collection, informative path planning must be computed online and be responsive to new observations. In this work, we present IA-TIGRIS (Incremental and Adaptive Tree-based Information Gathering Using Informed Sampling), which is an incremental and adaptive sampling-based informative path planner designed for real-time onboard execution. Our approach leverages past planning efforts through incremental refinement while continuously adapting to updated belief maps. We additionally present detailed implementation and optimization insights to facilitate real-world deployment, along with an array of reward functions tailored to specific missions and behaviors. Extensive simulation results demonstrate IA-TIGRIS generates higher-quality paths compared to baseline methods. We validate our planner on two distinct hardware platforms: a hexarotor unmanned aerial vehicle (UAV) and a fixed-wing UAV, each having different motion models and configuration spaces. Our results show up to a 38% improvement in information gain compared to baseline methods, highlighting the planner's potential for deployment in real-world applications. Project website: https://ia-tigris.github.io
comment: Published in IEEE Transactions on Robotics, 19 pages, 19 figures
♻ ☆ D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learning in Robotic Manipulation
Robotic manipulation remains challenging for reinforcement learning due to contact-rich dynamics, long horizons, and training instability. Although off-policy actor-critic algorithms such as SAC and TD3 perform well in simulation, they often suffer from policy oscillations and performance collapse in realistic settings, partly due to experience replay strategies that ignore the differing data requirements of the actor and the critic. We propose D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay, a replay framework that decouples actor and critic sampling while maintaining a shared replay buffer. The critic leverages prioritized replay for efficient value learning, whereas the actor is updated using low-error transitions to stabilize policy optimization. An adaptive anchor mechanism balances uniform and prioritized sampling based on the coefficient of variation of TD errors, and a Huber-based critic objective further improves robustness under heterogeneous reward scales. We evaluate D-SPEAR on challenging robotic manipulation tasks from the robosuite benchmark, including Block-Lifting and Door-Opening. Results demonstrate that D-SPEAR consistently outperforms strong off-policy baselines, including SAC, TD3, and DDPG, in both final performance and training stability, with ablation studies confirming the complementary roles of the actorside and critic-side replay streams.
comment: Accepted at IEEE 11th International Conference on Control and Robotics Engineering (ICCRE 2026)
♻ ☆ When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.
♻ ☆ OMCL: Open-vocabulary Monte Carlo Localization
Robust robot localization is an important prerequisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. These open-vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
comment: Accepted to IEEE RA-L
♻ ☆ Pixel Motion Diffusion is What We Need for Robot Control CVPR 2026
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
comment: Accepted to CVPR 2026. Project page: https://eronguyen.github.io/DAWN
♻ ☆ COMPAct: Computational Optimization and Automated Modular design of Planetary Actuators ICRA 2026
The optimal design of robotic actuators is a critical area of research, yet limited attention has been given to optimizing gearbox parameters and automating actuator CAD. This paper introduces COMPAct: Computational Optimization and Automated Modular Design of Planetary Actuators, a framework that systematically identifies optimal gearbox parameters for a given motor across four gearbox types, single-stage planetary gearbox (SSPG), compound planetary gearbox (CPG), Wolfrom planetary gearbox (WPG), and double-stage planetary gearbox (DSPG). The framework minimizes mass and actuator width while maximizing efficiency, and further automates actuator CAD generation to enable direct 3D printing without manual redesign. Using this framework, optimal gearbox designs are explored across a wide range of gear ratios, providing insights into the suitability of different gearbox types while automatically generating CAD models for all four gearbox types with varying gear ratios and motors. Two actuator types are fabricated and experimentally evaluated through power efficiency, no-load backlash, and transmission stiffness tests. Experimental results indicate that the SSPG actuator achieves a mechanical efficiency of 60-80%, a no-load backlash of 0.59 deg, and a transmission stiffness of 242.7 Nm/rad, while the CPG actuator demonstrates 60% efficiency, 2.6 deg backlash, and a stiffness of 201.6 Nm/rad. CODE: https://github.com/singhaman1750/COMPAct.git VIDEO: https://youtu.be/etK6anjXag8?si=jFK7HgAPSBy-GnDR
comment: 8 pages, 9 Figures, 2 tables; first two authors contributed equally; published in 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
♻ ☆ Where-to-Learn: Analytical Policy Gradient Directed Exploration for On-Policy Robotic Reinforcement Learning
On-policy reinforcement learning (RL) algorithms have demonstrated great potential in robotic control, where effective exploration is crucial for efficient and high-quality policy learning. However, how to encourage the agent to explore the better trajectories efficiently remains a challenge. Most existing methods incentivize exploration by maximizing the policy entropy or encouraging novel state visiting regardless of the potential state value. We propose a new form of directed exploration that uses analytical policy gradients from a differentiable dynamics model to inject task-aware, physics-guided guidance, thereby steering the agent towards high-reward regions for accelerated and more effective policy learning.
comment: 8 pages, 10 figures
♻ ☆ Robust Geospatial Coordination of Multi-Agent Communications Networks Under Attrition
Coordinating emergency responses in extreme environments, such as wildfires, requires resilient and high-bandwidth communication backbones. While autonomous aerial swarms can establish ad-hoc networks to provide this connectivity, the high risk of individual node attrition in these settings often leads to network fragmentation and mission-critical downtime. To overcome this challenge, we introduce and formalize the problem of Robust Task Networking Under Attrition (RTNUA), which extends connectivity maintenance in multi-robot systems to explicitly address proactive redundancy and attrition recovery. We then introduce Physics-Informed Robust Employment of Multi-Agent Networks ($Φ$IREMAN), a topological algorithm leveraging physics-inspired potential fields to solve this problem. In our evaluations, $Φ$IREMAN consistently outperforms baselines, and is able to maintain greater than $99.9\%$ task uptime despite substantial attrition in simulations with up to 100 tasks and 500 drones, demonstrating both effectiveness and scalability.
comment: 8 pages, 4 figures, 4 tables, accepted to IEEE RA-L
♻ ☆ A Player Selection Network for Scalable Game-Theoretic Prediction and Planning
While game-theoretic planning frameworks are effective at modeling multi-agent interactions, they require solving large optimization problems where the number of variables increases with the number of agents, resulting in long computation times that limit their use in large-scale, real-time systems. To address this issue, we propose 1) PSN Game-a learning-based, game-theoretic prediction and planning framework that reduces game size by learning a Player Selection Network (PSN); and 2) a Goal Inference Network (GIN) that makes it possible to use the PSN in incomplete-information games where other agents' intentions are unknown to the ego agent. A PSN outputs a player selection mask that distinguishes influential players from less relevant ones, enabling the ego player to solve a smaller, masked game involving only selected players. By reducing the number of players included in the game, PSN shrinks the corresponding optimization problems, leading to faster solve times. Experiments in both simulated scenarios and real-world pedestrian trajectory datasets show that PSN is competitive with, and often improves upon, the evaluated explicit game-theoretic selection baselines in 1) prediction accuracy and 2) planning safety. Across scenarios, PSN typically selects substantially fewer players than are present in the full game, thereby reducing game size and planning complexity. PSN also generalizes to settings in which agents' objectives are unknown, via the GIN, without test-time fine-tuning. By selecting only the most relevant players for decision-making, PSN Game provides a practical mechanism for reducing planning complexity that can be integrated into existing multi-agent planning frameworks.
♻ ☆ RoboNeuron: A Middle-Layer Infrastructure for Agent-Driven Orchestration in Embodied AI
Vision-language-action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad-hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re-integration. We introduce RoboNeuron, a middleware layer that connects the Model Context Protocol (MCP) for LLM agents with robot middleware such as ROS2. RoboNeuron bridges these ecosystems by deriving agent-callable tools directly from ROS schemas, providing a unified execution abstraction that supports both direct commands and modular composition, and localizing backend, runtime, and acceleration-preset changes within a stable inference boundary. We evaluate RoboNeuron in simulation and on hardware through multi-platform base control, arm motion, and VLA-based grasping tasks, demonstrating that it enables modular system orchestration under a unified interface while supporting backend transitions without system rewiring. The full code implementation of this work is available at github repo: https://github.com/guanweifan/RoboNeuron
♻ ☆ Ego-Foresight: Self-supervised Learning of Agent-Aware Representations for Improved RL
Despite the significant advances in Deep Reinforcement Learning (RL) observed in the last decade, the amount of training experience necessary to learn effective policies remains one of the primary concerns in both simulated and real environments. Looking to solve this issue, previous work has shown that improved efficiency can be achieved by separately modeling the agent and environment, but usually requires a supervisory signal. In contrast to RL, humans can perfect a new skill from a small number of trials and often do so without a supervisory signal, making neuroscientific studies of human development a valuable source of inspiration for RL. In particular, we explore the idea of motor prediction, which states that humans develop an internal model of themselves and of the consequences that their motor commands have on the immediate sensory inputs. Our insight is that the movementofthe agent provides a cue that allows the duality between the agent and environment to be learned. To instantiate this idea, we present Ego-Foresight (EF), a self-supervised method for disentangling agent information based on motion and prediction. Our main finding is that, when used as an auxiliary task in feature learning, self-supervised agent awareness improves the sample-efficiency and performance of the underlying RL algorithm. To test our approach, we study the ability of EF to predict agent movement and disentangle agent information. Then, we integrate EF with model-free and model based RL algorithms to solve simulated control tasks, showing improved sample-efficiency and performance.
comment: 13 pages, 8 figures, conference
♻ ☆ TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation CVPR 2026
Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals. In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames. Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of up to 33\% on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up 150 times. The code is open-sourced at https://github.com/Kin-Zhang/TeFlow along with trained model weights.
comment: CVPR 2026; 16 pages, 8 figures
♻ ☆ RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
Vision-Language-Action (VLA) systems have shown strong potential for language-driven robotic manipulation. However, scaling them to long-horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi-policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM-driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self-resetting loops for autonomous data collection. This mechanism enables continuous on-policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high-level reasoning and dynamically orchestrates learned policy primitives to accomplish long-horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi-policy robustness. Experiments in real-world manipulation tasks demonstrate improved stability and scalability compared to conventional open-loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long-horizon tasks and reducing human time investment by 53.7%.
comment: Code available at: https://github.com/RoboClaw-Robotics/RoboClaw
♻ ☆ Geometric Visual Servo Via Optimal Transport
When developing control laws for robotic systems, the principle factor when examining their performance is choosing inputs that allow smooth tracking to a reference input. In the context of robotic manipulation, this involves translating an object or end-effector from an initial pose to a target pose. Robotic manipulation control laws frequently use vision systems as an error generator to track features and produce control inputs. However, current control algorithms don't take into account the probabilistic features that are extracted and instead rely on hand-tuned feature extraction methods. Furthermore, the target features can exist in a static pose thus allowing a combined pose and feature error for control generation. We present a geometric control law for the visual servoing problem for robotic manipulators. The input from the camera constitutes a probability measure on the 3-dimensional Special Euclidean task-space group, where the Wasserstein distance between the current and desired poses is analogous with the geometric geodesic. From this, we develop a controller that allows for both pose and image-based visual servoing by combining classical PD control with gravity compensation with error minimization through the use of geodesic flows on a 3-dimensional Special Euclidean group. We present our results on a set of test cases demonstrating the generalisation ability of our approach to a variety of initial positions.
comment: 19 pages, 5 figures. Accepted to Control Engineering Practice
♻ ☆ DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving
We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.
comment: authors update
♻ ☆ The Indirect Method for Generating Libraries of Optimal Periodic Trajectories and Its Application to Economical Bipedal Walking
Trajectory optimization is an essential tool for generating efficient, dynamically consistent gaits in legged locomotion. This paper explores the indirect method of trajectory optimization, emphasizing its application in creating optimal periodic gaits for legged systems and contrasting it with the more common direct method. While the direct method provides flexibility in implementation, it is limited by its need for an input space parameterization. In contrast, the indirect method improves accuracy by computing the control input from states and costates obtained along the optimal trajectory. In this work, we tackle the convergence challenges associated with indirect shooting methods by utilizing numerical continuation methods. This is particularly useful for the systematic development of gait libraries. Our contributions include: (1) the formalization of a general periodic trajectory optimization problem that extends existing first-order necessary conditions to a broader range of cost functions and operating conditions; (2) a methodology for efficiently generating libraries of optimal trajectories (gaits) utilizing a single shooting approach combined with numerical continuation methods; (3) a novel approach for reconstructing Lagrange multipliers and costates from passive gaits; (4) a comparative analysis of the indirect and direct shooting methods using a compass-gait walker as a case study, demonstrating the improved accuracy of the indirect method in generating optimal gaits; and (5) demonstrating applicability to the more complex legged robot RABBIT, with ten dynamic states and four inputs. The findings underscore the potential of the indirect method for generating families of optimal gaits, thereby advancing the field of trajectory optimization in legged robotics.
comment: submitted to the International Journal of Robotics Research (IJRR)
♻ ☆ Precise Time Delay Measurement and Compensation for Tightly Coupled Underwater SINS/piUSBL Navigation
In multisensor systems, time synchronization is particularly challenging for underwater integrated navigation systems (INSs) incorporating acoustic positioning, where time delays can significantly degrade accuracy when measurement and fusion epochs are misaligned. This article introduces a tightly coupled navigation framework that integrates a passive inverted ultrashort baseline (piUSBL) acoustic positioning system, a strapdown inertial navigation system (SINS), and a depth gauge under precise time synchronization. The framework fuses piUSBL azimuth and slant range with depth measurements, avoiding poor vertical-angle observability in planar arrays. By combining synchronized timing with acoustic signal processing, the proposed method transforms delay from an unobservable error into a measurable parameter, enabling explicit quantification of both acoustic propagation and system processing delays. Field experiments demonstrate that the proposed approach reduces position RMSE by 44.02% and maximum error (MAXERR) by 40.79% compared to the uncompensated baseline while achieving further RMSE reductions of 37.66% and 35.82% in horizontal directions relative to filter-based delay compensation. The results confirm that explicit delay measurement outperforms filter-based estimation though instantaneous performance remains sensitive to acoustic signal quality, emphasizing the need for robust signal processing alongside accurate time synchronization in latency-sensitive multisensor systems.
comment: Published in IEEE Transactions on Instrumentation and Measurement. This is the author's accepted manuscript
♻ ☆ KnowDiffuser: A Knowledge-Guided Diffusion Planner with LLM Reasoning
Recent advancements in Language Models (LMs) have demonstrated strong semantic reasoning capabilities, enabling their application in high-level decision-making for autonomous driving (AD). However, LMs operate over discrete token spaces and lack the ability to generate continuous, physically feasible trajectories required for motion planning. Meanwhile, diffusion models have proven effective at generating reliable and dynamically consistent trajectories, but often lack semantic interpretability and alignment with scene-level understanding. To address these limitations, we propose \textbf{KnowDiffuser}, a knowledge-guided motion planning framework that tightly integrates the semantic understanding of language models with the generative power of diffusion models. The framework employs a language model to infer context-aware meta-actions from structured scene representations, which are then mapped to prior trajectories that anchor the subsequent denoising process. A two-stage truncated denoising mechanism refines these trajectories efficiently, preserving both semantic alignment and physical feasibility. Experiments on the nuPlan benchmark demonstrate that KnowDiffuser significantly outperforms existing planners in both open-loop and closed-loop evaluations, establishing a robust and interpretable framework that effectively bridges the semantic-to-physical gap in AD systems.
comment: 10pages, 1 figure
♻ ☆ RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Visual Contextual Adaptation ICRA 2026
Efficient target localization and autonomous navigation in complex environments are fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on ground-truth depth and pose information, which restricts applicability in real-world scenarios; and (2) lack of visual in-context learning (VICL) capability to extract geometric and semantic priors from environmental context, as in a short traversal video. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong VICL capability. By simply observing a short video of the target environment, the system can also significantly improve task efficiency without requiring architectural modifications or task-specific retraining. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior VICL adaptability, with no previous 3D mapping of the environment required.
comment: Accepted at ICRA 2026
♻ ☆ Geometric-Photometric Event-based 3D Gaussian Ray Tracing
Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes GPERT, a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real-world datasets and competitive performance on the synthetic dataset. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. https://github.com/e3ai/gpert
comment: 15 pages, 12 figures, 5 tables
♻ ☆ C-NAV: Towards Self-Evolving Continual Object Navigation in Open World NeurIPS 2025
Embodied agents are expected to perform object navigation in dynamic, open-world environments. However, existing approaches typically rely on static trajectories and a fixed set of object categories during training, overlooking the real-world requirement for continual adaptation to evolving scenarios. To facilitate related studies, we introduce the continual object navigation benchmark, which requires agents to acquire navigation skills for new object categories while avoiding catastrophic forgetting of previously learned knowledge. To tackle this challenge, we propose C-Nav, a continual visual navigation framework that integrates two key innovations: (1) A dual-path anti-forgetting mechanism, which comprises feature distillation that aligns multi-modal inputs into a consistent representation space to ensure representation consistency, and feature replay that retains temporal features within the action decoder to ensure policy consistency. (2) An adaptive sampling strategy that selects diverse and informative experiences, thereby reducing redundancy and minimizing memory overhead. Extensive experiments across multiple model architectures demonstrate that C-Nav consistently outperforms existing approaches, achieving superior performance even compared to baselines with full trajectory retention, while significantly lowering memory requirements. The code will be publicly available at https://bigtree765.github.io/C-Nav-project.
comment: Accepted at NeurIPS 2025
♻ ☆ Situationally-Aware Dynamics Learning
Autonomous robots operating in complex, unstructured environments face significant challenges due to latent, unobserved factors that obscure their understanding of both their internal state and the external world. Addressing this challenge would enable robots to develop a more profound grasp of their operational context. To tackle this, we propose a novel framework for online learning of hidden state representations, with which the robots can adapt in real-time to uncertain and dynamic conditions that would otherwise be ambiguous and result in suboptimal or erroneous behaviors. Our approach is formalized as a Generalized Hidden Parameter Markov Decision Process, which explicitly models the influence of unobserved parameters on both transition dynamics and reward structures. Our core innovation lies in learning online the joint distribution of state transitions, which serves as an expressive representation of latent ego- and environmental-factors. This probabilistic approach supports the identification and adaptation to different operational situations, improving robustness and safety. Through a multivariate extension of Bayesian Online Changepoint Detection, our method segments changes in the underlying data generating process governing the robot's dynamics. The robot's transition model is then informed with a symbolic representation of the current situation derived from the joint distribution of latest state transitions, enabling adaptive and context-aware decision-making. To showcase the real-world effectiveness, we validate our approach in the challenging task of unstructured terrain navigation, where unmodeled and unmeasured terrain characteristics can significantly impact the robot's motion. Extensive experiments in both simulation and real world reveal significant improvements in data efficiency, policy performance, and the emergence of safer, adaptive navigation strategies.
CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion
Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting representation with a gated residual fusion block, and performs temporal integration with a Gated Recurrent Unit (GRU) regulated by a highway-style output gate for state-dependent blending of recurrent and feedforward features. To further improve terrain interaction, we introduce a terrain-aware foothold placement reward that extracts supportable foothold candidates from foot-end point-cloud samples and rewards touchdown locations that lie close to the nearest supportable candidate. Experiments in simulation and on a physical humanoid demonstrate robust traversal over diverse terrains and effective zero-shot transfer to real-world scenes containing handrails, hollow pallet assemblies, severe reflective interference, and visually cluttered outdoor surroundings.
♻ ☆ Do World Action Models Generalize Better than VLAs? A Robustness Study
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as $π_{0.5}$ can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.
♻ ☆ House of Dextra: Cross-embodied Co-design for Dexterous Hands
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website: https://an-axolotl.github.io/HouseofDextra/ .
Optimization and Control 63
☆ The Newton-Muon Optimizer
The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix $W$ using only three matrices: the gradient $G$, an output-space curvature matrix $H$, and the data matrix $Z$ that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) $W \leftarrow W - η\cdot \mathrm{msgn}(G(ZZ^\top)^{-1})$, where $η$ is the learning rate and $\mathrm{msgn}(X)=UV^\top$ if $X=USV^\top$ is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, on a reproduction of the earliest publicly released Modded-NanoGPT speedrun configuration using Muon for GPT-2 pretraining, Newton-Muon reaches the target validation loss in 6\% fewer iteration steps and reduces wall-clock training time by about 4\%.
☆ Discrete-Time Event-Triggered Extremum Seeking
This paper proposes a discrete-time event-triggered extremum seeking control scheme for real-time optimization of nonlinear systems. Unlike conventional discrete-time implementations relying on periodic updates, the proposed approach updates the control input only when a state-dependent triggering condition is satisfied, reducing unnecessary actuation and communication. The resulting closed-loop system combines extremum seeking with an event-triggering mechanism that adaptively determines the input update instants. Using discrete-time averaging and Lyapunov analysis, we establish practical convergence of the trajectories to a neighborhood of the unknown extremum point and show exponential stability of the associated average dynamics. The proposed method preserves the optimization capability of classical extremum seeking while significantly reducing the number of input updates. Simulation results illustrate the effectiveness of the approach for resource-aware real-time optimization.
☆ Rendezvous Planning from Sparse Observations of Optimally Controlled Targets
We develop a probabilistic framework for \emph{rendezvous planning}: given sparse, noisy observations of a fast-moving target, plan rendezvous spatiotemporal coordinates for a set of significantly slower seeking agents. The unknown target trajectory is estimated under uncertain dynamics using a filtering approach that combines a kernel-based maximum a posteriori estimation with Gaussian process correction, producing a mixture over trajectory hypotheses. This estimate is used to select spatiotemporal rendezvous points that maximize the probability of successful rendezvous. Points are chosen sequentially by greedily minimizing failure probability in the current belief space, which is updated after each step by conditioning on unsuccessful rendezvous attempts. We show that the failure-conditioned update correctly captures the posterior belief for subsequent decisions, ensuring that each step in the greedy sequence is informed by a statistically consistent representation of the remaining search space, and derive the corresponding Bayesian updates incorporating temporal correlations intrinsic to the trajectory model. This result provides a systematic framework for planning under uncertainty in applications of autonomous rendezvous such as unmanned aerial vehicle refueling, spacecraft servicing, autonomous surface vessel operations, search and rescue missions, and missile defense. In each, the motion of the target entity can be modeled using a system of differential equations undergoing optimal control for a chosen objective, in our example case Hamilton--Jacobi--Bellman solutions for minimum arrival time of a Dubins car with uncertain turning radius and destination.
☆ Distributed Variational Quantum Linear Solver
This paper develops a distributed variational quantum algorithm for solving large-scale linear equations. For a linear system of the form $Ax=b$, the large square matrix $A$ is partitioned into smaller square block submatrices, each of which is known only to a single noisy intermediate-scale quantum (NISQ) computer. Each NISQ computer communicates with certain other quantum computers in the same row and column of the block partition, where the communication patterns are described by the row- and column-neighbor graphs, both of which are connected. The proposed algorithm integrates a variant of the variational quantum linear solver at each computer with distributed classical optimization techniques. The derivation of the quantum cost function provides insight into the design of the distributed algorithm. Numerical quantum simulations demonstrate that the proposed distributed quantum algorithm can solve linear systems whose size scales with the number of computers and is therefore not limited by the capacity of a single quantum computer.
☆ On the critical time of observability of the multi-dimensional Baouendi-Grushin equation
We investigate the observability properties of the Baouendi-Grushin equation on a tensorized domain $Ω:= \mathcal{B}_R \times \tilde Ω$, where $\mathcal{B}_R$ is the open ball of radius $R$ in dimension $d \ge 2$, and $\tilde Ω$ is a smooth, bounded, open set of arbitrary dimension. Our main result is a precise calculation of the minimal observability time $T^*$, for tensorized observation sets of the form $ω\times \tilde Ω$, with $ω\subset \mathcal{B}_R$ (internal observation), and $Γ\times \tilde Ω$, with $Γ\subset \partial \mathcal{B}_R$ (boundary observation). The main novelty regards the sufficient condition, that is observability of the system when $T>T^*$. This is established by combining refined observability inequalities on the annulus--or the entire boundary--using Carleman estimates, together with a Lebeau-Robbiano strategy to localize the observation sets.
☆ Causal Optimal Coupling for Gaussian Input-Output Distributional Data
We study the problem of identifying an optimal coupling between input-output distributional data generated by a causal dynamical system. The coupling is required to satisfy prescribed marginal distributions and a causality constraint reflecting the temporal structure of the system. We formulate this problem as a Schr"odinger Bridge, which seeks the coupling closest - in Kullback-Leibler divergence - to a given prior while enforcing both marginal and causality constraints. For the case of Gaussian marginals and general time-dependent quadratic cost functions, we derive a fully tractable characterization of the Sinkhorn iterations that converges to the optimal solution. Beyond its theoretical contribution, the proposed framework provides a principled foundation for applying causal optimal transport methods to system identification from distributional data.
☆ Concentration of Stochastic System Trajectories with Time-varying Contraction Conditions
We establish two concentration inequalities for nonlinear stochastic system under time-varying contraction conditions. The key to our approach is an energy function termed Averaged Moment Generating Function (AMGF). By combining it with incremental stability analysis, we develop a concentration inequality that bounds the deviation between the stochastic system state and its deterministic counterpart. As this inequality is restricted to single time instance, we further combine AMGF with martingale-based methods to derive a concentration inequality that bounds the fluctuation of the entire stochastic trajectory. Additionally, by synthesizing the two results, we significantly improve the trajectory-level concentration inequality for strongly contractive systems. Given the probability level $1-δ$, the derived inequalities ensure an $\mO(\sqrt{\log(1/δ))}$ bound on the deviation of stochastic trajectories, which is tight under our assumptions. Our results are exemplified through a case study on stochastic safe control.
☆ Joint Pricing and Innovation Control in Regulated Recycling-Rate Diffusion
We introduce a regulated stochastic diffusion model for the recycling rate and formulate a joint control problem over production and process innovation via the dynamics of recycling investment and product pricing. The resulting stochastic control problem captures the system manager's trade-off between product-price decisions and investment expenditures under an infinite-horizon discounted cost structure. Owing to the recycling-rate specification, we incorporate two regulated state processes, which induce additional policy-driven cost components in the value function consistent with green-economy regulations. We resolve the jointly regulated stochastic production and process-innovation admission control problem by introducing the associated Hamilton-Jacobi-Bellman (HJB) equation and providing rigorous proofs that establish the correspondence between the HJB solution and the value function of the underlying control problem. The HJB equation is analyzed under mild, practically motivated assumptions on the system parameters. We further present numerical experiments and sensitivity analyses to illustrate the tractability of the HJB characterization and to assess the practical relevance of the imposed parameter conditions.
comment: 33 pages, 9 figures
☆ Exponential Stabilization of Moving Shockwave in ARZ Traffic Model via Boundary Control: Explicit Gains and Arbitrary Decay Rate
This paper develops boundary feedback controls to stabilize traffic congestion toward a predefined shock equilibrium in the Aw-Rascle-Zhang (ARZ) traffic flow model. We transform the corresponding moving-boundary $2\times2$ hyperbolic system, covering free and congested flow regimes, respectively, into a shock-free $4\times4$ augmented system on a fixed domain via shock-location-based moving coordinates. By applying the modified Lyapunov functionals concerning shock perturbation, we show that the shock position and the state of the system in $H^2$-norm can be stabilized with an arbitrary exponential decay rate via the given feedback controls. Finally, the stabilization results are demonstrated by numerical simulations.
☆ Residuals-based Offline Reinforcement Learning
Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.
☆ Sterile mosquito release via intelligent proportional controllers
The Sterile Insect Technique (SIT) against insect pests and insect vectors consists of releasing males that have been previously sterilized in order to reduce or eliminate a specific wild population. We study this complex control question via model-free control, ultra-local models, and intelligent proportional controllers that have already proven their effectiveness in various fields. They permit addressing, perhaps for the first time, the essential sampling question. Computer simulations are displayed and discussed.
comment: The 6th International Symposium on Complex Systems -- June 03-05, 2026 -- La Rochelle, France
☆ Output-Feedback Controller Synthesis for Dissipativity and $H_2$ Performance of Autoregressive Systems from Noisy Input-Output Data
In this paper we propose a data-driven output-feedback controller synthesis method for discrete-time linear time-invariant systems in a specific autoregressive form. The synthesis goal is either to achieve dissipativity with respect to a given quadratic supply rate, or to achieve given $H_2$ performance level. It is assumed that the model of the plant is unknown, except for the disturbance term. To compensate for the lack of model knowledge, we have a recorded trajectory of the controlled input and the output available for control, which can be corrupted by an unknown but bounded disturbance. Derived controller synthesis method is in the form of linear matrix inequalities and is nonconservative within the considered problem setting.
comment: 9 pages, 1 figure
☆ The Non-Linearity Perturbation Threshold: Width Scaling and Landscape Bifurcations in Deep Learning
We study how the optimization landscape of a neural network deforms as a non-linear activation is introduced through a smooth homotopy. Working first in an abstract local setting - a smooth one-parameter family of objective functions together with a critical branch that loses non-degeneracy through a simple Hessian kernel - we show via Lyapunov-Schmidt reduction that the local transition is controlled by the classical codimension-one normal forms (transcritical or pitchfork) and that the associated topology change is governed by Morse-theoretic handle attachment. We then move beyond the abstract framework and verify these assumptions for a concrete two-layer architecture. We prove that bilinear overparameterization creates an (m-1)d-dimensional Hessian kernel at the linear endpoint, which Tikhonov regularization lifts to a floor alpha > 0; the activation homotopy softens this floor, yielding an explicit bifurcation point lambda* approximately equal to alpha/|lambda_1'(0)|. We derive the eigenvalue-softening rate as a functional of activation derivatives and data moments, and prove that the near-pitchfork normal form (|g_aa/g_aaa| much less than 1) is a structural consequence of sigma''(0)=0 for tanh-like activations. The bifurcation point scales as lambda* proportional to alpha m with network width, connecting the framework to the NTK regime: at large m the landscape reorganization is pushed past lambda=1 and the linearized picture prevails. The foundational algebraic theorems have been formally verified in the Lean 4 theorem prover, and theoretical predictions computed for widths m in {3, 5, 10, 20, 50, 100} exhibit quantitative agreement with the abstract framework.
☆ Risk Control of Traffic Flow Through Chance Constraints and Large Deviation Approximation
Existing macroscopic traffic control methods often struggle to strictly regulate rare, safety-critical extreme events under stochastic disturbances. In this paper, we develop a rare chance-constrained optimal control framework for autonomous traffic management. To efficiently enforce these probabilistic safety specifications, we exploit a large deviation theory (LDT) based approximation method, which converts the original highly non-convex, sampling-heavy optimization problem into a tractable deterministic nonlinear programming problem. In addition, the proposed LDT-based reformulation exhibits superior computational scalability, as it maintains a constant computational burden regardless of the target violation probability level, effectively bypassing the extreme scaling bottlenecks of traditional sampling-based methods. The effectiveness of the proposed framework in achieving precise near-target probability control and superior computational efficiency over risk-averse baselines is illustrated through extensive numerical simulations across diverse traffic risk measures.
☆ Finite-time stabilization via impulse control of degenerate singular parabolic equations
This paper examines the impulse controllability of degenerate singular parabolic equations through a modern framework focused on finite-time stabilization. Furthermore, we provide an explicit estimate for the exponential decay of the solution. The proof of our main result combines a logarithmic convexity estimate with specific spectral properties. Finally, we establish the existence and uniqueness of the minimal norm impulse control associated with the system.
☆ An Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance Analysis
Designing reliable integrated energy systems for industrial processes requires optimization and verification models across multiple fidelities, from architecture-level sizing to high-fidelity dynamic operation. However, model mismatch across fidelities obscures the sources of performance loss and complicates the quantification of architecture-to-operation performance gaps. We propose an online, machine-learning-accelerated multi-resolution optimization framework that estimates an architecture-specific upper bound on achievable performance while minimizing expensive high-fidelity model evaluations. We demonstrate the approach on a pilot energy system supplying a 1 MW industrial heat load. First, we solve a multi-objective architecture optimization to select the system configuration and component capacities. We then develop an machine learning (ML)-accelerated multi-resolution, receding-horizon optimal control strategy that approaches the achievable-performance bound for the specified architecture, given the additional controls and dynamics not captured by the architectural optimization model. The ML-guided controller adaptively schedules the optimization resolution based on predictive uncertainty and warm-starts high-fidelity solves using elite low-fidelity solutions. Our results on the pilot case study show that the proposed multi-resolution strategy reduces the architecture-to-operation performance gap by up to 42% relative to a rule-based controller, while reducing required high-fidelity model evaluations by 34% relative to the same multi-fidelity approach without ML guidance, enabling faster and more reliable design verification. Together, these gains make high-fidelity verification tractable, providing a practical upper bound on achievable operational performance.
☆ On the mean-variance problem through the lens of multivariate fake stationary affine Volterra dynamics
We investigate the continuous-time Markowitz mean-variance portfolio selection problem within a multivariate class of fake stationary affine Volterra models. In this non-Markovian and non-semimartingale market framework with unbounded random coefficients, the classical stochastic control approach cannot be directly applied to the associated optimization task. Instead, the problem is tackled using a stochastic factor solution to a Riccati backward stochastic differential equation (BSDE). The optimal feedback control is characterized by means of this equation, whose explicit solutions is derived in terms of multi-dimensional Riccati-Volterra equations. Specifically, we obtain analytical closed-form expressions for the optimal portfolio policies as well as the mean-variance efficient frontier, both of which depend on the solution to the associated multivariate Riccati-Volterra system. To illustrate our results, numerical experiments based on a two dimensional fake stationary rough Heston model highlight the impact of rough volatilities and stochastic correlations on the optimal Markowitz strategies.
comment: 35 pages, 8 figures
☆ Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method
We introduce Sven (Singular Value dEsceNt), a new optimization algorithm for neural networks that exploits the natural decomposition of loss functions into a sum over individual data points, rather than reducing the full loss to a single scalar before computing a parameter update. Sven treats each data point's residual as a separate condition to be satisfied simultaneously, using the Moore-Penrose pseudoinverse of the loss Jacobian to find the minimum-norm parameter update that best satisfies all conditions at once. In practice, this pseudoinverse is approximated via a truncated singular value decomposition, retaining only the $k$ most significant directions and incurring a computational overhead of only a factor of $k$ relative to stochastic gradient descent. This is in comparison to traditional natural gradient methods, which scale as the square of the number of parameters. We show that Sven can be understood as a natural gradient method generalized to the over-parametrized regime, recovering natural gradient descent in the under-parametrized limit. On regression tasks, Sven significantly outperforms standard first-order methods including Adam, converging faster and to a lower final loss, while remaining competitive with LBFGS at a fraction of the wall-time cost. We discuss the primary challenge to scaling, namely memory overhead, and propose mitigation strategies. Beyond standard machine learning benchmarks, we anticipate that Sven will find natural application in scientific computing settings where custom loss functions decompose into several conditions.
☆ Safe learning-based control via function-based uncertainty quantification
Uncertainty quantification is essential when deploying learning-based control methods in safety-critical systems. This is commonly realized by constructing uncertainty tubes that enclose the unknown function of interest, e.g., the reward and constraint functions or the underlying dynamics model, with high probability. However, existing approaches for uncertainty quantification typically rely on restrictive assumptions on the unknown function, such as known bounds on functional norms or Lipschitz constants, and struggle with discontinuities. In this paper, we model the unknown function as a random function from which independent and identically distributed realizations can be generated, and construct uncertainty tubes via the scenario approach that hold with high probability and rely solely on the sampled realizations. We integrate these uncertainty tubes into a safe Bayesian optimization algorithm, which we then use to safely tune control parameters on a real Furuta pendulum.
comment: Under review for CDC 2026
☆ Affine Normal Directions via Log-Determinant Geometry: Scalable Computation under Sparse Polynomial Structure
Affine normal directions provide intrinsic affine-invariant descent directions derived from the geometry of level sets. Their practical use, however, has long been hindered by the need to evaluate third-order derivatives and invert tangent Hessians, which becomes computationally prohibitive in high dimensions. In this paper, we show that affine normal computation admits an exact reduction to second-order structure: the classical third-order contraction term is precisely the gradient of the log-determinant of the tangent Hessian. This identity replaces explicit third-order tensor contraction by a matrix-free formulation based on tangent linear solves, Hessian-vector products, and log-determinant gradient evaluation. Building on this reduction, we develop exact and stochastic matrix-free procedures for affine normal evaluation. For sparse polynomial objectives, the algebraic closure of derivatives further yields efficient sparse kernels for gradients, Hessian-vector products, and directional third-order contractions, leading to scalable implementations whose cost is governed by the sparsity structure of the polynomial representation. We establish end-to-end complexity bounds showing near-linear scaling with respect to the relevant sparsity scale under fixed stochastic and Krylov budgets. Numerical experiments confirm that the proposed MF-LogDet formulation reproduces the original autodifferentiation-based affine normal direction to near machine precision, delivers substantial runtime improvements in moderate and high dimensions, and exhibits empirical near-linear scaling in both dimension and sparsity. These results provide a practical computational route for affine normal evaluation and reveal a new connection between affine differential geometry, log-determinant curvature, and large-scale structured optimization.
comment: 33 pages, 9 figures
☆ Spectral Decomposition of Discrete-Time Controllability Gramian and Its Inverse via System Eigenvalues
This paper develops a closed-form spectral decomposition framework for the Gramian matrices of discrete-time linear dynamical systems. The main results provide explicit decompositions of the discrete-time controllability Gramian and its inverse in terms of the eigenvalues of the dynamics matrix, yielding a mode-resolved representation of these matrices. In contrast to the more common use of aggregate Gramian characteristics, such as eigenvalues, singular values, determinants, and trace-based metrics, the proposed approach describes the internal structure of the Gramian itself through contributions associated with individual modes and their pairwise combinations. The framework is extended further to the solution of the discrete-time Lyapunov difference equation, placing the obtained formulas in a broader context relevant to the analysis and computation of time-varying and nonlinear systems. In addition, the decomposition is generalized to systems whose dynamics matrix has multiple eigenvalues, enabling a closed-form estimation of the effects of resonant interactions between eigenmodes. The proposed results provide a structural tool for the analysis of controllability, observability and stability in discrete-time systems and complement existing Gramian-based methods used in model reduction, estimation, actuator and sensor selection, and energy-aware control. Beyond their theoretical interest, the derived decompositions may support the development of improved computational procedures and more informative performance criteria for a range of discrete-time control problems.
☆ Stability, Contraction, and Controllers for Affine Systems
Recent developments in data-driven control have revived interest in the behavioral approach to systems theory, where systems are defined as sets of trajectories rather than being described by a specific model or representation. However, most available results remain confined to linear systems, limiting the applicability of recent methods to complex behaviors. Affine systems form a natural intermediate class: they arise from linearization, capture essential nonlinear effects, and retain sufficient structure for analysis and design. This paper derives necessary and sufficient conditions independent of any particular representation for three fundamental stability problems for affine behaviors: (i) converse Lyapunov theorems for contraction of input-output systems; (ii) implementability and existence of prescribed contractive references; and (iii) whether these references can be implemented with linear or affine feedback control. For the latter, we show that linear controllers suffice for implementing contractive closed-loop, and and affine controllers are needed for equilibrium placement.
☆ Stein Variational Uncertainty-Adaptive Model Predictive Control
We propose a Stein variational distributionally robust controller for nonlinear dynamical systems with latent parametric uncertainty. The method is an alternative to conservative worst-case ambiguity-set optimization with a deterministic particle-based approximation of a task-dependent uncertainty distribution, enabling the controller to concentrate on parameter sensitivities that most strongly affect closed-loop performance. Our method yields a controller that is robust to latent parameter uncertainty by coupling optimal control with Stein variational inference, and avoiding restrictive parametric assumptions on the uncertainty model while preserving computational parallelism. In contrast to classical DRO, which can sacrifice nominal performance through worst-case design, we find our approach achieves robustness by shaping the control law around relevant uncertainty that are most critical to the task objective. The proposed framework therefore reconciles robust control and variational inference in a single decision-theoretic formulation for broad classes of control systems with parameter uncertainty. We demonstrate our approach on representative control problems that empirically illustrate improved performance-robustness tradeoffs over nominal, ensemble, and classical distributionally robust baselines.
☆ A Bilevel Integer Programming Approach for the Synchronous Attractor Control Problem
Boolean networks are dynamical models of disease development in which the activation levels of genes are represented by binary variables. Given a Boolean network, controls represent mutations or medical treatments that fix the activation levels of selected genes so that all states in every attractor (i.e., long-term recurrent states) satisfy a desired phenotype. Our goal is to enumerate all minimal controls, identifying critical gene subsets in disease development and therapy. This problem has an inherent bilevel integer programming structure and is computationally challenging. We propose an infeasibility-based Benders decomposition, a logic-based Benders framework for bilevel integer programs with multiple subproblems. In our application, each subproblem finds a forbidden attractor of a given length and yields a problem-specific feasibility cut. We also propose an auxiliary IP called subspace separation that finds a Boolean subspace that includes multiple forbidden attractors and thereby strengthens the cut. Numerical experiments show that the resulting algorithms are much more scalable than state-of-the-art methods and that subspace separation substantially improves performance.
comment: 30 pages, 8 figures
☆ Polynomial Parametric Koopman Operators for Stochastic MPC
This paper develops a parametric Koopman operator framework for Stochastic Model Predictive Control (SMPC), where the Koopman operator is parametrized by Polynomial Chaos Expansions (PCEs). The model is learned from data using the Extended Dynamic Mode Decomposition -- Dictionary Learning (EDMD-DL) method, which preserves the convex least-squares structure for the PCE coefficients of the EDMD matrix. Unlike conventional stochastic Galerkin projection approaches, we derive a condensed deterministic reformulation of the SMPC problem whose dimension scales only with the control horizon and input dimension, and is independent of both the lifted state dimension and the number of retained PCE terms. Our framework, therefore, enables efficient nonlinear SMPC problems with expectation and second-order moment constraints with standard convex optimization solvers. Numerical examples demonstrate the efficacy of our framework for uncertainty-aware SMPC of nonlinear systems.
comment: 8 pages, 5 figures, submitted to CDC 2026
☆ Chvátal-Gomory Rounding of Eigenvector Inequalities for QCQPs
We introduce and analyze a class of valid inequalities for nonconvex quadratically constrained optimization problems (QCQPs) which we call Eigen-CG inequalities. These inequalities are obtained by applying a Chvátal-Gomory (CG) rounding to the well-known eigenvector inequalities for QCQPs, and transferring binary-valid inequalities to the continuous setting via a result of Burer and Letchford (2009). We define three nested subfamilies and prove that they are strictly contained in one another. However, we show that the convex conic closure of two of these subfamilies is equal and, in fact, coincides with the Boros-Hammer inequalities -- a powerful family of inequalities that include, in particular, the triangle and McCormick inequalities. Using this CG perspective, we also prove that dense Eigen-CG inequalities are ineffective when used with the standard SDP+McCormick relaxation. This provides a complementary perspective on what is observed in practice: that sparse inequalities are impactful. Finally, based on these insights, we develop a computational strategy to find sparse Eigen-CG cuts and verify their effectiveness in nonconvex QCQP instances. Our results confirm that density quickly degrades effectiveness, but that including sparse inequalities beyond triangle inequalities can provide significant improvements in dual bounds.
☆ Soft projections for robust data-driven control
We consider data-based predictive control based on behavioral systems theory. In the linear setting this means that a system is described as a subspace of trajectories, and predictive control can be formulated using a projection onto the intersection of this behavior and a constraint set. Instead of learning the model, or subspace, we focus on determining this projection from data. Motivated by the use of regularization in data-enabled predictive control (DeePC), we introduce the use of soft projections, which approximate the true projector onto the behavior from noisy data. In the simplest case, these are equivalent to known regularized DeePC schemes, but they exhibit a number of benefits. First, we provide a bound on the approximation error consisting of a bias and a variance term that can be traded-off by the regularization weight. The derived bound is independent of the true system order, highlighting the benefit of soft projections compared to low-dimensional subspace estimates. Moreover, soft projections allow for intuitive generalizations, one of which we show has superior performance on a case study. Finally, we provide update formulas for soft projectors enabling the efficient adaptation of the proposed data-driven control methods in the case of streaming data.
☆ Min-Max Grassmannian Optimization for Online Subspace Tracking
This paper discusses robustness guarantees for online tracking of time-varying subspaces from noisy data. Building on recent work in optimization over a Grassmannian manifold, we introduce a new approach for robust subspace tracking by modeling data uncertainty in a Grassmannian ball. The robust subspace tracking problem is cast into a min-max optimization framework, for which we derive a closed-form solution for the worst-case subspace, enabling a geometric robustness adjustment that is both analytically tractable and computationally efficient, unlike iterative convex relaxations. The resulting algorithm, GeRoST (Geometrically Robust Subspace Tracking), is validated on two case studies: tracking a linear time-varying system and online foreground-background separation in video.
comment: Submitted to the 65th IEEE Conference on Decision and Control, December 15-18 2026, Honolulu, Hawaii, USA
☆ Multicentric thrombus segmentation using an attention-based recurrent network with gradual modality dropout
Detecting and delineating tiny targets in 3D brain scans is a central yet under-addressed challenge in medical imaging.In ischemic stroke, for instance, the culprit thrombus is small, low-contrast, and variably expressed across modalities(e.g., susceptibility-weighted T2 blooming, diffusion restriction on DWI/ADC), while real-world multi-center dataintroduce domain shifts, anisotropy, and frequent missing sequences. We introduce a methodology that couples an attention-based recurrent segmentation network (UpAttLLSTM), a training schedule that progressively increases the difficulty of hetero-modal learning, with gradual modality dropout, UpAttLLSTM aggregates context across slices via recurrent units (2.5D) and uses attention gates to fuse complementary cues across available sequences, making it robust to anisotropy and class imbalance. Gradual modality dropout systematically simulates site heterogeneity,noise, and missing modalities during training, acting as both augmentation and regularization to improve multi-center generalization. On a monocentric cohort, our approach detects thrombi in >90% of cases with a Dice score of 0.65. In a multi-center setting with missing modalities, it achieves-80% detection with a Dice score around 0.35. Beyond stroke, the proposed methodology directly transfers to other small-lesion tasks in 3D medical imaging where targets are scarce, subtle, and modality-dependent
☆ Convergence of projected stochastic natural gradient variational inference for various step size and sample or batch size schedules
Stochastic natural gradient variational inference (NGVI) is a popular and efficient algorithm for Bayesian inference. Despite empirical success, the convergence of this method is still not fully understood. In this work, we define and study a projected stochastic NGVI when variational distributions form an exponential family. Stochasticity arises when either gradients are intractable expectations or large sums. We prove new non-asymptotic convergence results for combinations of constant or decreasing step sizes and constant or increasing sample/batch sizes. When all hyperparameters are fixed, NGVI is shown to converge geometrically to a neighborhood of the optimum, while we establish convergence to the optimum with rates of the form $\mathcal{O}\left(\frac{1}{T^ρ} \right)$, possibly with $ρ\geq 1$, for all other combinations of step size and sample/batch size schedules. These rates apply when the target posterior distribution is close in some sense to the considered exponential family. Our theoretical results extend existing NGVI and stochastic optimization results and provide more flexibility to adjust, in a principled way, step sizes and sample/batch sizes in order to meet speed, resources, or accuracy constraints.
☆ Distributed partial state estimation for linear state-space systems
This study is concerned with the problem of partial state estimation for linear time-invariant (LTI) distributed state-space systems. A necessary and sufficient condition is established in terms of a simple rank criterion involving the system coefficient matrices, provided the communication graph is either directed, balanced and strongly connected or undirected and connected. The estimator parameter matrices are obtained by simple matrix theory. Finally, a numerical example demonstrates the feasibility and effectiveness of the proposed theoretical results and design algorithm.
☆ Discrete adjoint gradient computation for multiclass traffic flow models on road networks
This paper applies a discrete adjoint gradient computation method for a multi-class traffic flow model on road networks. Vehicle classes are characterized by their specific velocity functions, which depend on the total traffic density, resulting in a coupled hyperbolic system of conservation laws. The system is discretized using a Godunov-type finite volume scheme based on demand and supply functions, extended to handle complex junction coupling conditions -- such as merges and diverges -- and boundary conditions with buffer lengths to account for congestion spillback. The optimization of different travel-related performance metrics, including total travel time and total travel distance, is formulated as a constrained minimization problem and is accomplished through the use of an adjoint gradient approach, allowing for an efficient computation of sensitivities with respect to the chosen time-dependent control variables. Numerical simulations on a sample network demonstrate the efficiency of the proposed framework, particularly as the number of control parameters increases. This approach provides a robust and computationally efficient solution, making it suitable for large-scale traffic network optimization.
comment: 33 pages, 24 figures
☆ Theoretical Perspectives on Jabr-Type Convex Relaxations for AC Optimal Power Flow
The alternating current optimal power flow problem is a fundamental yet highly nonconvex optimization problem whose structure reflects both nonlinear power flow physics and the topology of the underlying network. Among convex relaxations, the second-order cone relaxation introduced by Jabr has proven particularly influential, serving as a computationally efficient alternative to semidefinite relaxations and a foundation for numerous strengthening techniques. In recent years, a variety of approaches have been proposed to tighten Jabr-type relaxations, including cycle-based constraints, convex envelopes of multilinear terms, and dual reformulations. However, these developments are often presented independently, concealing their common geometric and graph-theoretic foundations. This paper provides a structured review of strengthening techniques for the Jabr relaxation and develops a unifying perspective based on multilinear equalities. We reinterpret cycle constraints as multilinear consistency conditions, analyze their convexification through classical convex hull theory, and investigate the relationship between primal McCormick relaxations and dual extended formulations. In particular, we identify structural conditions under which these relaxations coincide and clarify the distinction between convexifying the interaction graph and convexifying the feasible set of the ACOPF. The resulting framework connects graph structure, multilinear convexification, and conic relaxations in a unified manner, offering both a conceptual synthesis of existing results and new insights for the design of stronger relaxations.
☆ No quantum advantage implies improved bounds and classical algorithms for the binary paint shop problem
The binary paint shop problem (BPSP) is an APX-hard optimization problem in which, given n car models that occur twice in a sequence of length 2n, the goal is to find a colouring sequence such that the two occurrences of each model are painted differently, while minimizing the number of times the paint is swapped along the sequence. A recent classical heuristic, known as the recursive star greedy (RSG) algorithm, is conjectured to achieve an expected paint swap ratio of 0.361, thereby outperforming the QAOA with circuit depth p = 7. Since the performance of the QAOA with logarithmic circuit depth is instance independent, the average paint swap ratio is upper-bounded by the QAOA, which numerical evidence suggests is approximately between 0.265 and 0.282. To provide hardware-relevant comparisons, we additionally implement the BPSP on a D-Wave Quantum Annealer Advantage 2, obtaining a minimum paint swap ratio of 0.320. Given that the QAOA with logarithmic circuit depth does not exhibit quantum advantage for sparse optimization problems such as the BPSP, this implies the existence of a classical algorithm that surpasses both the RSG algorithm and logarithmic depth QAOA. We provide numerical evidence that the Mean-Field Approximate Optimization Algorithm (MF-AOA) is one such algorithm that beats all known classical heuristics and quantum algorithms to date with a paint swap ratio of approximately 0.2799.
comment: 8 pages, 1 figure. This work has been submitted to the IEEE quantum week (QCE 26) for possible publication
☆ An Accelerated Proximal Bundle Method with Momentum
Proximal bundle methods (PBM) are a powerful class of algorithms for convex optimization. Compared to gradient descent, PBM constructs more accurate surrogate models that incorporate gradients and function values from multiple past iterations, which leads to faster and more robust convergence. However, for smooth convex problems, PBM only achieves an O(1/k) convergence rate, which is inferior to the optimal O(1/k^2) rate. To bridge this gap, we propose an accelerated proximal bundle method (APBM) that integrates Nesterov's momentum into PBM. We prove that under standard assumptions, APBM achieves the optimal O(1/k^2) convergence rate. Numerical experiments demonstrate the effectiveness of the proposed APBM.
☆ Scenario theory for multi-criteria data-driven decision making
The scenario approach provides a powerful data-driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications - including multi-agent decision problems - require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a general scenario theory for multi-criteria data-driven decision making. A central innovation lies in the collective treatment of the risks associated with violations of individual criteria, which yields substantially more accurate robustness certificates than those derived from a naive application of standard results. In turn, this approach enables a sharper quantification of the robustness level with which all criteria are simultaneously satisfied. The proposed framework applies broadly to multi-criteria data-driven decision problems, providing a principled, scalable, and theoretically grounded methodology for design under uncertainty.
☆ Polynomial Stability for Weakly Coupled System with Partial Controls
We study the stability of general weakly coupled systems subject to a reduced number of local or boundary controls. We show that, under Kalman's rank condition, the exponential stability of the underlying scalar equation implies polynomial stability of the full coupled system. Moreover, the decay rate remains unchanged regardless of the number of equations in the system. The proof relies on resolvent estimates and a clever exploitation of Kalman's rank condition to ensure effective transmission of damping across the coupled equations. The abstract result is applied to several concrete models, including systems of wave equations with local viscous, local viscoelastic, or boundary damping; systems of plate equations with internal damping; and thermoelastic systems of type III. Moreover, the optimality of the decay rate is established via spectral analysis.
☆ Dynamic Weight Optimization for Double Linear Policy: A Stochastic Model Predictive Control Approach
The Double Linear Policy (DLP) framework guarantees a Robust Positive Expectation (RPE) under optimized constant-weight designs or admissible prespecified time-varying policies. However, the sequential optimization of these time-varying weights remains an open challenge. To address this gap, we propose a Stochastic Model Predictive Control (SMPC) framework. We formulate weight selection as a receding-horizon optimal control problem that explicitly maximizes risk-adjusted returns while enforcing survivability and predicted positive expectation constraints. Notably, an analytical gradient is derived for the non-convex objective function, enabling efficient optimization via the L-BFGS-B algorithm. Empirical results demonstrate that this dynamic, closed-loop approach improves risk-adjusted performance and drawdown control relative to constant-weight and prescribed time-varying DLP baselines.
comment: 8 pages. Submitted for possible publication
☆ A Musielak-Orlicz approach for modeling uncertainties in long-memory processes
This paper introduces a novel mathematical approach for modeling uncertainties in long-memory processes. We focus on the superposition of Ornstein-Uhlenbeck processes, the so-called supOU process, as a nominal model for complex phenomena having long memory. Uncertainties in a supOU process appear as distortions of its reversion and Levy measures, and we propose to simultaneously assess uncertainties of these measures by using state-dependent divergence functions on Musielak-Orlicz spaces. Our approach is reduced to solving a pair of optimization problems that give upper- and lower-bounds of the cumulants of supOU processes subject to a prescribed uncertainty set. We show that classical divergence functions, such as the Kullback-Leibler one, fail to well-define these optimization problems and how this drawback can be resolved by using proper Musielak-Orlicz spaces. Moreover, we provide sufficient conditions to solve these optimization problems. We demonstrate a computational application of the proposed framework to water environmental problems where the streamflow discharge is modeled as a supOU process. This paper provides a new viewpoint for studying long-memory processes along with applications.
☆ Revisiting the Constant-Rank Constraint Qualification for Second-Order Cone Programs
The constant rank constraint qualification (CRCQ) for second-order cone programs, introduced by Andreani et al. in [Math. Program. 202 (2023), 473 - 513], shares some desirable properties with its classical nonlinear programming counterpart; specifically, it guarantees strong second-order necessary conditions for optimality, and is independent of the Robinson constraint qualification. However, unlike the classical version, this new CRCQ can fail in the linear case, and it is unclear whether CRCQ implies the metric subregularity constraint qualification (MSCQ). The aim of this paper is to examine the CRCQ for second-order cone programs in the linear setting. First, we show that the facial constant rank property, which is a key requirement for the validity of CRCQ, does not always hold in this context. Then, we derive a necessary and sufficient condition for a feasible point to satisfy this property. After that, we establish an easily verifiable characterization of CRCQ. Finally, utilizing this characterization, we prove that CRCQ and MSCQ are equivalent.
Implicit Primal-Dual Interior-Point Methods for Quadratic Programming
This paper introduces a new method for solving quadratic programs using primal-dual interior-point methods. Instead of handling complementarity as an explicit equation in the Karush-Kuhn-Tucker (KKT) conditions, we ensure that complementarity is implicitly satisfied by construction. This is achieved by introducing an auxiliary variable and relating it to the duals and slacks via a retraction map. Specifically, we prove that the softplus function has favorable numerical properties compared to the commonly used exponential map. The resulting KKT system is guaranteed to be spectrally bounded, thereby eliminating the most pressing limitation of primal-dual methods: ill-conditioning near the solution. These attributes facilitate the solution of the underlying linear system, either by removing the need to compute factorizations at every iteration, enabling factorization-free approaches like indirect solvers, or allowing the solver to achieve high accuracy in low-precision arithmetic. Consequently, this novel perspective opens new opportunities for interior-point methods, especially for solving large-scale problems to high precision.
☆ Extended State Observer for Localized Fault Awareness in RF Accelerating Structures
An observer framework is presented for robust regulation of RF cavity fields and localized identification of disturbances in RF systems. A standard cavity field observer is augmented with additional states to estimate the evolution of cavity detuning and phase drifts induced by the drive and receiver chains. Monte Carlo simulations are performed to assess the performance of the proposed estimator under realistic conditions for the intended high-power linear accelerator operation. Results showcase precise cavity field regulation and the reliability with which the observer assigns deviations to the correct subsystem. The resulting diagnostic capability provides a foundation for improved fault detection, faster troubleshooting during accelerator operation, and more informed maintenance of RF systems in large accelerator facilities.
♻ ☆ Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk
To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.
♻ ☆ Macroscopic Traffic Flow Network Modeling For Wildfire Evacuation: A Game-Theoretic Junction Optimization Approach with Application to Lahaina Fire
The 2023 Lahaina wildfire killed 102 people on a peninsula served by a single two-lane highway, making exit lane capacity the binding constraint on evacuation time. We model the evacuation as a system of hyperbolic scalar conservation laws on a directed graph with game-theoretic junction conditions that maximize total network flux, an evacuation-calibrated piecewise linear-quadratic flux function, and a loss-driven optimization framework that tunes traffic distribution toward priority corridors. Analytical results on a toy network and numerical simulations of the Lahaina road network reveal a phase transition in exit lane capacity. Additional lanes improve throughput linearly until a computable critical threshold, beyond which no route optimization yields further benefit. For Lahaina, reversing one southbound lane captures nearly all achievable improvement, and a fourth lane can be reserved for emergency vehicles with negligible impact on civilian clearance time. These results provide a rigorous mathematical basis for contraflow recommendations in wildland-urban interface evacuations.
comment: 59 pages, 33 figures, 15 tables
♻ ☆ Online Fair Allocation of Perishable Resources
We consider a practically motivated variant of the canonical online fair allocation problem: a decision-maker has a budget of perishable resources to allocate over a fixed number of rounds. Each round sees a random number of arrivals, and the decision-maker must commit to an allocation for these individuals before moving on to the next round. The goal is to construct a sequence of allocations that is envy-free and efficient. Our work makes two important contributions toward this problem: we first derive strong lower bounds on the optimal envy-efficiency trade-off, demonstrating that a decision-maker is fundamentally limited in what she can hope to achieve relative to the no-perishing setting; we then design an algorithm achieving these lower bounds which takes as input (i) a prediction of the perishing order, and (ii) a desired bound on envy. Given the remaining budget in each period, the algorithm uses forecasts of future demand perishing to adaptively choose from one of two carefully constructed guardrail quantities. We demonstrate our algorithm's strong numerical performance, and state-of-the-art, perishing-agnostic algorithms' inefficacy, on simulations calibrated to a real-world dataset.
comment: 57 pages, 10 figures
♻ ☆ A Multi-Criterion Approach to Smart EV Charging with CO2 Emissions and Cost Minimization
We study carbon-aware smart charging in a fossil-dominated grid by coupling a simplified hydro-thermal-renewable dispatch model with a tractable linear charging scheduler. The case study is informed by Vietnam's regional data. Thermal units remain dominant, renewables are time-varying, and hydropower is modeled through a single reservoir budget. From the day-ahead dispatch we derive hourly carbon intensity and a corresponding carbon-cost signal; these are combined with a local time-of-use tariff in the EV charging problem. The resulting weighted-sum linear program is multi-objective: by sweeping the trade-off coefficient, we recover the supported Pareto frontier between electricity cost and charging-associated emissions. In a 300-EV public-charging scenario with a 0.8 MW feeder cap, the proposed carbon-aware scheduler preserves the 19.8% bill reduction of a cost-only optimizer while lowering charging-associated emissions by 7.3%; a more carbon-focused tuning still remains 12.6% cheaper and 9.3% cleaner than a FIFO baseline. A hydro-sensitivity study shows that changing the reservoir budget by +/- 20% moves the mean grid carbon intensity from 360 to 466 g/kWh, yet the carbon-aware scheduler remains consistently cheaper and cleaner than FIFO. The dispatch and charging LPs solve in few milliseconds on a standard desktop computer, showing that the framework is lightweight enough for repeated day-ahead studies.
comment: Paper submitted to the 65th IEEE Conference on Decision and Control in Honolulu, Hawaii
♻ ☆ A Practical GPU-Enhanced Matrix-Free Primal-Dual Method for Large-Scale Conic Programs
In this paper, we introduce a practical GPU-enhanced matrix-free first-order method for solving large-scale conic programming problems, which we refer to as PDCS, standing for the Primal-Dual Conic Programming Solver. Problems that it solves include linear programs, second-order cone programs, convex quadratic programs, and exponential cone programs. The method avoids matrix factorizations and leverages sparse matrix-vector multiplication as its core computational operation, which is both memory-efficient and well-suited for GPU acceleration. The method builds on the restarted primal-dual hybrid gradient method but further incorporates several enhancements. Additionally, it employs a bisection-based method to compute projections onto rescaled cones. Furthermore, cuPDCS is a GPU implementation of PDCS and it implements customized computational schemes that utilize different levels of GPU architecture to handle cones of different types and sizes. Numerical experiments demonstrate that cuPDCS is generally more efficient than state-of-the-art commercial solvers and other first-order methods on large-scale conic program applications, including Fisher market equilibrium problems, Lasso regression, and multi-period portfolio optimization. Furthermore, cuPDCS also exhibits better scalability, efficiency, and robustness compared to other first-order methods on the conic program benchmark dataset CBLIB. These advantages are more pronounced in large-scale, lower-accuracy settings.
comment: 37 pages, 8 figures
♻ ☆ Monotone and Conservative Policy Iteration Beyond the Tabular Case
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.
♻ ☆ A Player Selection Network for Scalable Game-Theoretic Prediction and Planning
While game-theoretic planning frameworks are effective at modeling multi-agent interactions, they require solving large optimization problems where the number of variables increases with the number of agents, resulting in long computation times that limit their use in large-scale, real-time systems. To address this issue, we propose 1) PSN Game-a learning-based, game-theoretic prediction and planning framework that reduces game size by learning a Player Selection Network (PSN); and 2) a Goal Inference Network (GIN) that makes it possible to use the PSN in incomplete-information games where other agents' intentions are unknown to the ego agent. A PSN outputs a player selection mask that distinguishes influential players from less relevant ones, enabling the ego player to solve a smaller, masked game involving only selected players. By reducing the number of players included in the game, PSN shrinks the corresponding optimization problems, leading to faster solve times. Experiments in both simulated scenarios and real-world pedestrian trajectory datasets show that PSN is competitive with, and often improves upon, the evaluated explicit game-theoretic selection baselines in 1) prediction accuracy and 2) planning safety. Across scenarios, PSN typically selects substantially fewer players than are present in the full game, thereby reducing game size and planning complexity. PSN also generalizes to settings in which agents' objectives are unknown, via the GIN, without test-time fine-tuning. By selecting only the most relevant players for decision-making, PSN Game provides a practical mechanism for reducing planning complexity that can be integrated into existing multi-agent planning frameworks.
♻ ☆ Quantum Approximate Optimization of Integer Graph Problems and Surpassing Semidefinite Programming for Max-k-Cut
Quantum algorithms for binary optimization problems have been the subject of extensive study. However, the application of quantum algorithms to integer optimization problems remains comparatively unexplored. In this paper, we study the Quantum Approximate Optimization Algorithm (QAOA) applied to integer problems on graphs, with each integer variable encoded in a qudit. We derive a general iterative formula for depth-$p$ QAOA expectation on high-girth $d$-regular graphs of arbitrary size. The cost of evaluating the formula is exponential in the QAOA depth $p$ but does not depend on the graph size. Evaluating this formula for Max-$k$-Cut problem for $p\leq 4$, we identify parameter regimes ($k=3$ with degree $d \leq 10$ and $k=4$ with $d \leq 40$) in which QAOA outperforms the Frieze-Jerrum semi-definite programming (SDP) algorithm, which provides the best worst-case guarantee on the approximation ratio. To strengthen the classical baseline we introduce a new heuristic algorithm, based on the degree-of-saturation, that empirically outperforms both the Frieze-Jerrum algorithm and shallow-depth QAOA. Nevertheless, we provide numerical evidence that QAOA may overtake this heuristic at depth $p\leq 20$. Our results show that moving beyond binary to integer optimization problems can open up new avenues for quantum advantage.
comment: Suggestions and comments are welcome
♻ ☆ No-Regret Generative Modeling via Parabolic Monge-Ampère PDE
We introduce a novel generative modeling framework based on a discretized parabolic Monge-Ampère PDE, which emerges as a continuous limit of the Sinkhorn algorithm commonly used in optimal transport. Our method performs iterative refinement in the space of Brenier maps using a mirror gradient descent step. We establish theoretical guarantees for generative modeling through the lens of no-regret analysis, demonstrating that the iterates converge to the optimal Brenier map under a variety of step-size schedules. As a technical contribution, we derive a new Evolution Variational Inequality tailored to the parabolic Monge-Ampère PDE, connecting geometry, transportation cost, and regret. Our framework accommodates non-log-concave target distributions, constructs an optimal sampling process via the Brenier map, and integrates favorable learning techniques from generative adversarial networks and score-based diffusion models. As direct applications, we illustrate how our theory paves new pathways for generative modeling and variational inference.
comment: 30 pages, 7 figures. Journal version accepted for publication in the Annals of Statistics
♻ ☆ Stochastic Optimal Control for Systems with Drifts of Bounded Variation: A Maximum Principle Approach
We study a stochastic control problem for nonlinear systems governed by stochastic differential equations with irregular drift. The drift coefficient is assumed to decompose as $b(t,x,a)=b_1(t,x)+b_2(x)b_3(t,a)$, where $b_1$ is bounded and Borel measurable, $b_2$ has bounded variation, and $b_3$ is bounded and smooth. Under these minimal regularity assumptions, we establish a Pontryagin-type stochastic maximum principle. The analysis relies on new results for SDEs with random drift of bounded variation, including existence, uniqueness, and Malliavin-Sobolev differentiability of the state process. A key ingredient is an explicit representation of the first variation process obtained via integration with respect to the space-time local time of bounded variation processes. By combining a suitable approximation scheme with Ekeland's variational principle, and using a Garcia-Rodemich-Rumsey inequality to obtain a uniform control of the first variation, we derive the maximum principle. As an application, we derive an optimal corridor-type capital adjustment policy for an insurance surplus model.
comment: 37 pages
♻ ☆ On analysis of open optimization algorithms
We consider optimization algorithms that are open systems, that is, with external inputs and outputs. Such algorithms arise for instance, when analyzing the effect of noise or disturbance on an algorithm, or when an algorithm is part of control loop without timescale separation. Bridging between monotone operator theory and energy-based modeling, we consider analysis results in the form of incremental dissipativity certificates, yielding tests in the form of linear matrix inequalities. To be precise, we consider robustness in terms of incremental small gain, and composition results for optimization algorithms operating in closed loop.
♻ ☆ Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation
In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation--namely, the need to compute or approximate Hessian inverse--we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.
♻ ☆ Projection-Free Algorithms for Minimax Problems
This paper addresses constrained smooth saddle-point problems in settings where projection onto the feasible sets is computationally expensive. We bridge the gap between projection-based and projection-free optimization by introducing a unified dual dynamic smoothing framework that enables the design of efficient single-loop algorithms. Within this framework, we establish convergence results for nonconvex-concave and nonconvex-strongly concave settings. Furthermore, we show that this framework is naturally applicable to convex-concave problems. We propose and analyze three algorithmic variants based on the application of a linear minimization oracle over the minimization variable, the maximization variable, or both. Notably, our analysis yields anytime convergence guarantees without requiring a pre-specified iteration horizon. These results significantly narrow the performance gap between projection-free and projection-based methods for minimax optimization.
♻ ☆ Infinite-Horizon Optimal Control of Jump-Diffusion Models for Pollution-Dependent Disasters
This paper is devoted to developing a unified framework for stochastic growth models with environmental risk, in which rare but catastrophic shocks interact with capital accumulation and pollution. The analysis is based upon a general Poisson point process formulation, leading to non-local Hamilton-Jacobi-Bellman (HJB) equations that admit closed-form candidate solutions and yield a composite state variable capturing exposure to rare shocks. We consider cases where disaster risk is endogenized through a pollution-dependent intensity and, in the more general cases, it also accommodates for state-dependent events of varying magnitude. Our formulation captures how environmental degradation amplifies macroeconomic vulnerability and strengthens incentives for abatement. From a technical perspective, it provides tractable jump-diffusion control problems whose HJB equation decomposes naturally into capital and pollution components under power-type value function.
♻ ☆ Flatness-based control of a Timoshenko beam
The paper presents an approach to flatness-based control design for hyperbolic multi-input systems, building upon the hyperbolic controller form (HCF). The transformation into HCF yields a simplified system representation that considerably facilitates the design of state feedback controllers for trajectory tracking. The proposed concept is demonstrated for a Timoshenko beam and validated through numerical simulations, demonstrating trajectory tracking and closed-loop stability.
comment: Accepted at European Control Conference (ECC 2026)
♻ ☆ On the Effectiveness of Classical Regression Methods for Optimal Switching Problems
Simple regression methods provide robust, near-optimal solutions for optimal switching problems, including high-dimensional ones (up to 50). While the theory requires solving intractable PDE systems, the Longstaff-Schwartz algorithm with classical regression methods achieves excellent switching decisions without extensive hyperparameter tuning. Testing linear models (OLS, Ridge, LASSO), tree-based methods (random forests, gradient boosting), $k$-nearest neighbors, and feedforward neural networks on four benchmark problems, we find that several simple methods maintain stable performance across diverse problem characteristics, outperforming the neural networks we tested against. In our comparison, $k$-NN regression performs consistently well, and with minimal hyperparameter tuning. We establish concentration bounds for this regressor and show that PCA enables $k$-NN to scale to high dimensions.
♻ ☆ High Probability Complexity Bounds of Trust-Region Stochastic Sequential Quadratic Programming with Heavy-Tailed Noise
In this paper, we consider nonlinear optimization problems with a stochastic objective and deterministic equality constraints. We propose a Trust-Region Stochastic Sequential Quadratic Programming (TR-SSQP) method and establish its high-probability iteration complexity bounds for identifying first- and second-order $ε$-stationary points. In our algorithm, we assume that exact objective values, gradients, and Hessians are not directly accessible but can be estimated via zeroth-, first-, and second-order probabilistic oracles. Compared to existing complexity studies of SSQP methods that rely on a zeroth-order oracle with sub-exponential tail noise (i.e., light-tailed) and focus mostly on first-order stationarity, our analysis accommodates biased (also referred to as irreducible in the literature) and heavy-tailed noise in the zeroth-order oracle, and significantly extends the analysis to second-order stationarity. We show that under heavy-tailed noise conditions, our SSQP method achieves the same high-probability first-order iteration complexity bounds as in the light-tailed noise setting, while further exhibiting promising second-order iteration complexity bounds. Specifically, the method identifies a first-order $ε$-stationary point in $\mathcal{O}(ε^{-2})$ iterations and a second-order $ε$-stationary point in $\mathcal{O}(ε^{-3})$ iterations with high probability, provided that $ε$ is lower bounded by a constant determined by the bias magnitude (i.e., the irreducible noise) in the estimation. We validate our theoretical findings and evaluate practical performance of our method on CUTEst benchmark test set.
comment: 66 pages, 7 figures
♻ ☆ Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch
Policy gradient methods are one of the most successful approaches for solving challenging reinforcement learning problems. Despite their empirical successes, many state-of-the-art policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the biased gradient induced by the mismatch still yields a valid first-order characterization of global optimality. Then, we extend this analysis to more general parameterizations by deriving explicit bounds on both the state distribution mismatch and the resulting gradient mismatch in episodic and continuing MDPs, which are shown to vanish at least linearly as the discount factor approaches one. Building on these bounds, we further establish guarantees for the biased policy gradient iterates, showing that they approach approximate stationary points with respect to the exact gradient, with asymptotic residuals depending on the discount factor. Our findings offer insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
♻ ☆ Situationally-Aware Dynamics Learning
Autonomous robots operating in complex, unstructured environments face significant challenges due to latent, unobserved factors that obscure their understanding of both their internal state and the external world. Addressing this challenge would enable robots to develop a more profound grasp of their operational context. To tackle this, we propose a novel framework for online learning of hidden state representations, with which the robots can adapt in real-time to uncertain and dynamic conditions that would otherwise be ambiguous and result in suboptimal or erroneous behaviors. Our approach is formalized as a Generalized Hidden Parameter Markov Decision Process, which explicitly models the influence of unobserved parameters on both transition dynamics and reward structures. Our core innovation lies in learning online the joint distribution of state transitions, which serves as an expressive representation of latent ego- and environmental-factors. This probabilistic approach supports the identification and adaptation to different operational situations, improving robustness and safety. Through a multivariate extension of Bayesian Online Changepoint Detection, our method segments changes in the underlying data generating process governing the robot's dynamics. The robot's transition model is then informed with a symbolic representation of the current situation derived from the joint distribution of latest state transitions, enabling adaptive and context-aware decision-making. To showcase the real-world effectiveness, we validate our approach in the challenging task of unstructured terrain navigation, where unmodeled and unmeasured terrain characteristics can significantly impact the robot's motion. Extensive experiments in both simulation and real world reveal significant improvements in data efficiency, policy performance, and the emergence of safer, adaptive navigation strategies.
♻ ☆ Perturbation Duality for Robust and Distributionally Robust Optimization: Short and General Proofs
Duality is a foundational tool in robust and distributionally robust optimization (RO and DRO), underpinning both analytical insights and tractable reformulations. The prevailing approaches in the literature primarily rely on saddle-point arguments, Lagrangian techniques, and conic duality. In contrast, this paper applies perturbation duality in the sense of Fenchel--Rockafellar convex analysis and demonstrates its effectiveness as a general and unifying methodology for deriving dual formulations in RO and DRO. We first apply perturbation duality to a recently proposed DRO framework that unifies phi-divergence and Wasserstein ambiguity sets through optimal transport with conditional moment constraints. We establish the associated dual representation without imposing compactness assumptions previously conjectured to be necessary, instead introducing alternative conditions motivated by perturbation analysis and leveraging the Interchangeability Principle. We then revisit the concept of robust duality -- commonly described as ``primal-worst equals dual-best'' -- and show that perturbation-based formulations provide a unified and transparent characterization of this principle. In particular, we develop a bifunction-based representation that encompasses existing formulations in the literature and yields concise and general proofs, substantially simplifying recent results. This work positions perturbation duality as a versatile and underutilized framework for RO and DRO, offering both conceptual unification and technical generality across a broad class of models.
comment: 26 pages
♻ ☆ High-probability Convergence Guarantees of Decentralized SGD
Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching, or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the MGF of strongly convex costs, which is of interest even in centralized settings. Finally, we provide experiments that validate our theory.
comment: 49 pages, 2 figures
Robotics 29
☆ Hierarchical Motion Planning and Control under Unknown Nonlinear Dynamics via Predicted Reachability
Autonomous motion planning under unknown nonlinear dynamics requires learning system properties while navigating toward a target. In this work, we develop a hierarchical planning-control framework that enables online motion synthesis with limited prior system knowledge. The state space is partitioned into polytopes and approximates the unknown nonlinear system using a piecewise-affine (PWA) model. The local affine models are identified once the agent enters the corresponding polytopes. To reduce computational complexity, we introduce a non-uniform adaptive state space partition strategy that refines the partition only in task-relevant regions. The resulting PWA system is abstracted into a directed weighted graph, whose edge existence is incrementally verified using reach control theory and predictive reachability conditions. Certified edges are weighted using provable time-to-reach bounds, while uncertain edges are assigned information-theoretic weights to guide exploration. The graph is updated online as new data becomes available, and high-level planning is performed by graph search, while low-level affine feedback controllers are synthesized to execute the plan. Furthermore, the conditions of classical reach control theory are often difficult to satisfy in underactuated settings. We therefore introduce relaxed reachability conditions to extend the framework to such systems. Simulations demonstrate effective exploration-exploitation trade-offs with formal reachability guarantees.
☆ Play-Testing REMind: Evaluating an Educational Robot-Mediated Role-Play Game
This paper presents REMind, an innovative educational robot-mediated role-play game designed to support anti-bullying bystander intervention among children. REMind invites players to observe a bullying scenario enacted by social robots, reflect on the perspectives of the characters, and rehearse defending strategies by puppeteering a robotic avatar. We evaluated REMind through a mixed-methods play-testing study with 18 children aged 9--10. The findings suggest that the experience supported key learning goals related to self-efficacy, perspective-taking, understanding outcomes of defending, and intervention strategies. These results highlight the promise of Robot-Mediated Applied Drama (RMAD) as a novel pedagogical framework to support Social-Emotional Learning.
comment: This work has been submitted to the IEEE for possible publication
☆ DreamControl-v2: Simpler and Scalable Autonomous Humanoid Skills via Trainable Guided Diffusion Priors
Developing robust autonomous loco-manipulation skills for humanoids remains an open problem in robotics. While RL has been applied successfully to legged locomotion, applying it to complex, interaction-rich manipulation tasks is harder given long-horizon planning challenges for manipulation. A recent approach along these lines is DreamControl, which addresses these issues by leveraging off-the-shelf human motion diffusion models as a generative prior to guide RL policies during training. In this paper, we investigate the impact of DreamControl's motion prior and propose an improved framework that trains a guided diffusion model directly in the humanoid robot's motion space, aggregating diverse human and robot datasets into a unified embodiment space. We demonstrate that our approach captures a wider range of skills due to the larger training data mixture and establishes a more automated pipeline by removing the need for manual filtering interventions. Furthermore, we show that scaling the generation of reference trajectories is important for achieving robust downstream RL policies. We validate our approach through extensive experiments in simulation and on a real Unitree-G1.
☆ Neural-Assisted in-Motion Self-Heading Alignment
Autonomous platforms operating in the oceans require accurate navigation to successfully complete their mission. In this regard, the initial heading estimation accuracy and the time required to achieve it play a critical role. The initial heading is traditionally estimated by model-based approaches employing orientation decomposition. However, methods such as the dual vector decomposition and optimized attitude decomposition achieve satisfactory heading accuracy only after long alignment times. To allow rapid and accurate initial heading estimation, we propose an end-to-end, model-free, neural-assisted framework using the same inputs as the model-based approaches. Our proposed approach was trained and evaluated on real-world dataset captured by an autonomous surface vehicle. Our approach shows a significant accuracy improvement over the model-based approaches achieving an average absolute error improvement of 53%. Additionally, our proposed approach was able to reduce the alignment time by up to 67%. Thus, by employing our proposed approach, the reduction in alignment time and improved accuracy allow for a shorter deployment time of an autonomous platform and increased navigation accuracy during the mission.
comment: 12 Pages, 10 Figures, 6 Tables
☆ Long-Horizon Geometry-Aware Navigation among Polytopes via MILP-MPC and Minkowski-Based CBFs
Autonomous navigation in complex, non-convex environments remains challenging when robot dynamics, control limits, and exact robot geometry must all be taken into account. In this paper, we propose a hierarchical planning and control framework that bridges long-horizon guidance and geometry-aware safety guarantees for a polytopic robot navigating among polytopic obstacles. At the high level, Mixed-Integer Linear Programming (MILP) is embedded within a Model Predictive Control (MPC) framework to generate a nominal trajectory around polytopic obstacles while modeling the robot as a point mass for computational tractability. At the low level, we employ a control barrier function (CBF) based on the exact signed distance in the Minkowski-difference space as a safety filter to explicitly enforce the geometric constraints of the robot shape, and further extend its formulation to a high-order CBF (HOCBF). We demonstrate the proposed framework in U-shaped and maze-like environments under single- and double-integrator dynamics. The results show that the proposed architecture mitigates the topology-induced local-minimum behavior of purely reactive CBF-based navigation while enabling safe, real-time, geometry-aware navigation.
comment: 8 pages, 3 figures
☆ HapCompass: A Rotational Haptic Device for Contact-Rich Robotic Teleoperation ICRA
The contact-rich nature of manipulation makes it a significant challenge for robotic teleoperation. While haptic feedback is critical for contact-rich tasks, providing intuitive directional cues within wearable teleoperation interfaces remains a bottleneck. Existing solutions, such as non-directional vibrations from handheld controllers, provide limited information, while vibrotactile arrays are prone to perceptual interference. To address these limitations, we propose HapCompass, a novel, low-cost wearable haptic device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA). We evaluated HapCompass's ability to convey directional cues to human operators and showed that it increased the success rate, decreased the completion time and the maximum contact force for teleoperated manipulation tasks when compared to vision-only and non-directional feedback baselines. Furthermore, we conducted a preliminary imitation-learning evaluation, suggesting that the directional feedback provided by HapCompass enhances the quality of demonstration data and, in turn, the trained policy. We release the design of the HapCompass device along with the code that implements our teleoperation interface: https://ripl.github.io/HapCompass/.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026. 8 pages, 5 figures. Project page: https://ripl.github.io/HapCompass/
☆ Beyond Symbolic Control: Societal Consequences of AI-Driven Workforce Displacement and the Imperative for Genuine Human Oversight Architectures
The accelerating displacement of human labor by artificial intelligence (AI) and robotic systems represents a structural transformation whose societal consequences extend far beyond conventional labor market analysis. This paper presents a systematic multi-domain examination of the likely effects on economic structure, psychological well-being, political stability, education, healthcare, and geopolitical order. We identify a critical and underexamined dimension of this transition: the governance gap between nominal human oversight of AI systems -- where humans occupy positions of formal authority over AI decisions -- and genuine human oversight, where those humans possess the cognitive access, technical capability, and institutional authority to meaningfully understand, evaluate, and override AI outputs. We argue that this distinction, largely absent from current governance frameworks including the EU AI Act and NIST AI Risk Management Framework 1.0, represents the primary architectural failure mode in deployed AI governance. The societal consequences of labor displacement intensify this problem by concentrating consequential AI decision-making among an increasingly narrow class of technical and capital actors. We propose five architectural requirements for genuine human oversight systems and characterize the governance window -- estimated at 10-15 years -- before current deployment trajectories risk path-dependent social, economic, and institutional lock-in.
comment: 23 pages, 23 references
☆ Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models
This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This integration allows robots to understand and carry out complex, human-like instructions while adapting to changing environments in real time. The framework is tested in a PyBullet-based simulation environment using the Franka Emika Panda robotic arm, with various manipulation scenarios as benchmarks. The results show a 33.5% decrease in task completion time and enhancements of 18.1% and 36.4% in accuracy and adaptability, respectively, when compared to systems that use only RL. These results underscore the potential of LLM-enhanced robotic systems for practical applications, making them more efficient, adaptable, and capable of interacting with humans. Future research will aim to explore sim-to-real transfer, scalability, and multi-robot systems to further broaden the framework's applicability.
☆ Passive iFIR filters for data-driven velocity control in robotics
We present a passive, data-driven velocity control method for nonlinear robotic manipulators that achieves better tracking performance than optimized PID with comparable design complexity. Using only three minutes of probing data, a VRFT-based design identifies passive iFIR controllers that (i) preserve closed-loop stability via passivity constraints and (ii) outperform a VRFT-tuned PID baseline on the Franka Research 3 robot in both joint-space and Cartesian-space velocity control, achieving up to a 74.5% reduction in tracking error for the Cartesian velocity tracking experiment with the most demanding reference model. When the robot end-effector dynamics change, the controller can be re-learned from new data, regaining nominal performance. This study bridges learning-based control and stability-guaranteed design: passive iFIR learns from data while retaining passivity-based stability guarantees, unlike many learning-based approaches.
☆ DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.
comment: Project page: https://xpeng-robotics.github.io/dial
☆ Reconfiguration of supernumerary robotic limbs for human augmentation
Wearable robots aim to seamlessly adapt to humans and their environment with personalized interactions. Existing supernumerary robotic limbs (SRLs), which enhance the physical capabilities of humans with additional extremities, have thus far been developed primarily for task-specific applications in structured industrial settings, limiting their adaptability to dynamic and unstructured environments. Here, we introduce a novel reconfigurable SRL framework grounded in a quantitative analysis of human augmentation to guide the development of more adaptable SRLs for diverse scenarios. This framework captures how SRL configuration shapes workspace extension and human-robot collaboration. We define human augmentation ratios to evaluate collaborative, visible extended, and non-visible extended workspaces, enabling systematic selection of SRL placement, morphology, and autonomy for a given task. Using these metrics, we demonstrate how quantitative augmentation analysis can guide the reconfiguration and control of SRLs to better match task requirements. We validate the proposed approach through experiments with a reconfigurable SRL composed of origami-inspired modular elements. Our results suggest that reconfigurable SRLs, informed by quantitative human augmentation analysis, offer a new perspective for providing adaptable human augmentation and assistance in everyday environments.
☆ SafeDMPs: Integrating Formal Safety with DMPs for Adaptive HRI
Robots operating in human-centric environments must be both robust to disturbances and provably safe from collisions. Achieving these properties simultaneously and efficiently remains a central challenge. While Dynamic Movement Primitives (DMPs) offer inherent stability and generalization from single demonstrations, they lack formal safety guarantees. Conversely, formal methods like Control Barrier Functions (CBFs) provide provable safety but often rely on computationally expensive, real-time optimization, hindering their use in high-frequency control. This paper introduces SafeDMPs, a novel framework that resolves this trade-off. We integrate the closed-form efficiency and dynamic robustness of DMPs with a provably safe, non-optimization-based control law derived from Spatio-Temporal Tubes (STTs). This synergy allows us to generate motions that are not only robust to perturbations and adaptable to new goals, but also guaranteed to avoid static and dynamic obstacles. Our approach achieves a closed-form solution for a problem that traditionally requires online optimization. Experimental results on a 7-DOF robot manipulator demonstrate that SafeDMPs is orders of magnitude faster and more accurate than optimization-based baselines, making it an ideal solution for real-time, safe, and collaborative robotics.
comment: 8 pages, 8 figures and 1 table
☆ Design and Aerodynamic Modeling of MetaMorpher: A Hybrid Rotary andFixed-Wing Morphing UAV
In this paper, we present a generalized, comprehensive nonlinear mathematical model and conceptual design for the MetaMorpher, a metamorphic Unmanned Aerial Vehicle (UAV) designed to bridge the gap between vertical takeoff and landing agility and fixed-wing cruising efficiency. Building on the successful design of the spincopter platform, this work introduces a simplified mechanical architecture using lightweight materials and a novel wing-folding strategy. Unlike traditional rigid-body approximations, we derive a nonlinear flight dynamics model that enables arbitrary force distributions across a segmented wing structure. This modularity allows for testing different airfoils, mass distributions, and chord lengths in a single environment. As part of this work, various flight modes were specifically tested and analyzed in the Simulink environment. The results show that the model behaves predictably under different structural configurations, demonstrating its reliability as a tool for rapid design evaluation.
comment: 8 pages, 12 figures
☆ Semantic Zone-Based Map Management for Stable AI-Integrated Mobile Robots
Recent advances in large AI models (VLMs and LLMs) and joint use of the 3D dense maps, enable mobile robots to provide more powerful and interactive services grounded in rich spatial context. However, deploying both heavy AI models and dense maps on edge robots is challenging under strict memory budgets. When the memory budget is exceeded, required keyframes may not be loaded in time, which can degrade the stability of position estimation and interfering model performance. We proposes a semantic zone-based map management approach to stabilize dense-map utilization under memory constraints. We associate keyframes with semantic indoor regions (e.g., rooms and corridors) and keyframe management at the semantic zone level prioritizes spatially relevant map content while respecting memory constraints. This reduces keyframe loading and unloading frequency and memory usage. We evaluate the proposed approach in large-scale simulated indoor environments and on an NVIDIA Jetson Orin Nano under concurrent SLAM-VLM execution. With Qwen3.5:0.8b, the proposed method improves throughput by 3.3 tokens/s and reduces latency by 21.7% relative to a geometric map-management strategy. Furthermore, while the geometric strategy suffers from out-of-memory failures and stalled execution under memory pressure, the proposed method eliminates both issues, preserving localization stability and enabling robust VLM operation. These results demonstrate that the proposed approach enables efficient dense map utilization for memory constrained, AI-integrated mobile robots. Code is available at: https://github.com/huichangs/rtabmap/tree/segment
♻ ☆ Zero-Shot Coordination in Ad Hoc Teams with Generalized Policy Improvement and Difference Rewards
Real-world multi-agent systems may require ad hoc teaming, where an agent must coordinate with other previously unseen teammates to solve a task in a zero-shot manner. Prior work often either selects a pretrained policy based on an inferred model of the new teammates or pretrains a single policy that is robust to potential teammates. Instead, we propose to leverage all pretrained policies in a zero-shot transfer setting. We formalize this problem as an ad hoc multi-agent Markov decision process and present a solution that uses two key ideas, generalized policy improvement and difference rewards, for efficient and effective knowledge transfer between different teams. We empirically demonstrate that our algorithm, Generalized Policy improvement for Ad hoc Teaming (GPAT), successfully enables zero-shot transfer to new teams in three simulated environments: cooperative foraging, predator-prey, and Overcooked. We also demonstrate our algorithm in a real-world multi-robot setting.
comment: 10 pages, 8 figures. To appear in proceedings of 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
♻ ☆ SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models CVPR 2026
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io
comment: Accepted to CVPR 2026; camera-ready version
♻ ☆ Interactive Force-Impedance Control
Human collaboration with robots requires flexible role adaptation, enabling the robot to switch between an active leader and a passive follower. Effective role switching depends on accurately estimating human intentions, which is typically achieved through external force analysis, nominal robot dynamics, or data-driven approaches. However, these methods are primarily effective in contact-sparse environments. When robots under hybrid or unified force-impedance control physically interact with active humans or non-passive environments, the robotic system may lose passivity and thus compromise safety. To address this challenge, this paper proposes a unified Interactive Force-Impedance Control (IFIC) framework that adapts to interaction power flow, ensuring safe and effortless interaction in contact-rich environments. The proposed control architecture is formulated within a port-Hamiltonian framework, incorporating both interaction and task control ports, thereby guaranteeing autonomous system passivity. Experiments in both rigid and soft contact scenarios demonstrate that IFIC ensures stable collaboration under active human interaction, reduces contact impact forces and interaction force oscillations.
♻ ☆ LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
comment: Accepted for publication in IEEE Access (DOI: 10.1109/ACCESS.2026.3678816). This is the author's version which has not been fully edited and content may change prior to final publication. 20 pages, 15 figures, 18 tables. The maneuver telemetry datasets are available in the GitHub repository under https://github.com/kdjebko/lelar-in-orbit-data
♻ ☆ Bridging the Basilisk Astrodynamics Framework with ROS 2 for Modular Spacecraft Simulation and Hardware Integration
Integrating high-fidelity spacecraft simulators with modular robotics frameworks remains a challenge for autonomy development. This paper presents a lightweight, open-source communication bridge between the Basilisk astrodynamics simulator and the Robot Operating System 2 (ROS 2), enabling real-time, bidirectional data exchange for spacecraft control. The bridge requires no changes to Basilisk's core and integrates seamlessly with ROS 2 nodes. We demonstrate its use in a leader-follower formation flying scenario using nonlinear model predictive control, deployed identically in both simulation and on the ATMOS planar microgravity testbed. This setup supports rapid development, hardware-in-the-loop testing, and seamless transition from simulation to hardware. The bridge offers a flexible and scalable platform for modular spacecraft autonomy and reproducible research workflows.
comment: Presented at the International Conference on Space Robotics (iSpaRo) 2025
♻ ☆ DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
Vision--Language--Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7\% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available https://chris1220313648.github.io/DFM-VLA/
♻ ☆ IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning
Although robot-to-robot (R2R) communication improves indoor scene understanding beyond what a single robot can achieve, R2R alone cannot overcome partial observability without substantial exploration overhead or scaling team size. In contrast, many indoor environments already include low-cost Internet of Things (IoT) sensors (e.g., cameras) that provide persistent, building-wide context beyond onboard perception. We therefore introduce IndoorR2X, the first benchmark and simulation framework for Large Language Model (LLM)-driven multi-robot task planning with Robot-to-Everything (R2X) perception and communication in indoor environments. IndoorR2X integrates observations from mobile robots and static IoT devices to construct a global semantic state that supports scalable scene understanding, reduces redundant exploration, and enables high-level coordination through LLM-based planning. IndoorR2X provides configurable simulation environments, sensor layouts, robot teams, and task suites to systematically evaluate high-level semantic coordination strategies. Extensive experiments across diverse settings demonstrate that IoT-augmented world modeling improves multi-robot efficiency and reliability, and we highlight key insights and failure modes for advancing LLM-based collaboration between robot teams and indoor IoT sensors. See our project website: https://fandulu.github.io/IndoorR2X_project_page/.
♻ ☆ "You've got a friend in me": Co-Designing a Peer Social Robot for Young Newcomers' Language and Cultural Learning
Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
♻ ☆ Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning
Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.
comment: Accepted to the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)
♻ ☆ Real-Time Operator Takeover for Visuomotor Diffusion Policy Training
We present a Real-Time Operator Takeover (RTOT) paradigm that enables operators to seamlessly take control of a live visuomotor diffusion policy, guiding the system back to desirable states or providing targeted corrective demonstrations. Within this framework, the operator can intervene to correct the robot's motion, after which control is smoothly returned to the policy until further intervention is needed. We evaluate the takeover framework on three tasks spanning rigid, deformable, and granular objects, and show that incorporating targeted takeover demonstrations significantly improves policy performance compared with training on an equivalent number of initial demonstrations alone. Additionally, we provide an in-depth analysis of the Mahalanobis distance as a signal for automatically identifying undesirable or out-of-distribution states during execution. Supporting materials, including videos of the initial and takeover demonstrations and all experiments, are available on the project website: https://operator-takeover.github.io/
♻ ☆ MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation
Generative robot policies such as Flow Matching offer flexible, multi-modal policy learning but are sample-inefficient. Although object-centric policies improve sample efficiency, it does not resolve this limitation. In this work, we propose Multi-Stream Generative Policy (MSG), an inference-time composition framework that trains multiple object-centric policies and combines them at inference to improve generalization and sample efficiency. MSG is model-agnostic and inference-only, hence widely applicable to various generative policies and training paradigms. We perform extensive experiments both in simulation and on a real robot, demonstrating that our approach learns high-quality generative policies from as few as five demonstrations, resulting in a 95% reduction in demonstrations, and improves policy performance by 89 percent compared to single-stream approaches. Furthermore, we present comprehensive ablation studies on various composition strategies and provide practical recommendations for deployment. Finally, MSG enables zero-shot object instance transfer. We make our code publicly available at https://msg.cs.uni-freiburg.de.
♻ ☆ UniLGL: Learning Uniform Place Recognition for FOV-limited/Panoramic LiDAR Global Localization
Existing LGL methods typically consider only partial information (e.g., geometric features) from LiDAR observations or are designed for homogeneous LiDAR sensors, overlooking the uniformity in LGL. In this work, a uniform LGL method is proposed, termed UniLGL, which simultaneously achieves spatial and material uniformity, as well as sensor-type uniformity. The key idea of the proposed method is to encode the complete point cloud, which contains both geometric and material information, into a pair of BEV images (i.e., a spatial BEV image and an intensity BEV image). An end-to-end multi-BEV fusion network is designed to extract uniform features, equipping UniLGL with spatial and material uniformity. To ensure robust LGL across heterogeneous LiDAR sensors, a viewpoint invariance hypothesis is introduced, which replaces the conventional translation equivariance assumption commonly used in existing LPR networks and supervises UniLGL to achieve sensor-type uniformity in both global descriptors and local feature representations. Finally, based on the mapping between local features on the 2D BEV image and the point cloud, a robust global pose estimator is derived that determines the global minimum of the global pose on SE(3) without requiring additional registration. To validate the effectiveness of the proposed uniform LGL, extensive benchmarks are conducted in real-world environments, and the results show that the proposed UniLGL is demonstratively competitive compared to other State-of-the-Art LGL methods. Furthermore, UniLGL has been deployed on diverse platforms, including full-size trucks and agile Micro Aerial Vehicles (MAVs), to enable high-precision localization and mapping as well as multi-MAV collaborative exploration in port and forest environments, demonstrating the applicability of UniLGL in industrial and field scenarios.
♻ ☆ Detection of Adversarial Attacks in Robotic Perception
Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.
comment: 9 pages, 6 figures. Accepted and presented at STE 2025, Transilvania University of Brasov, Romania
♻ ☆ Context-Triggered Contingency Games for Strategic Multi-Agent Interaction
We address the challenge of reliable and efficient interaction in autonomous multi-agent systems, where agents must balance long-term strategic objectives with short-term dynamic adaptation. We propose context-triggered contingency games, a novel integration of strategic games derived from temporal logic specifications with dynamic contingency games solved in real time. Our two-layered architecture leverages strategy templates to guarantee satisfaction of high-level objectives, while a new factor-graph-based solver enables scalable, real-time model predictive control of dynamic interactions. The resulting framework ensures both safety and progress in uncertain, interactive environments. We validate our approach through simulations and hardware experiments in autonomous driving and robotic navigation, demonstrating efficient, reliable, and adaptive multi-agent interaction.
♻ ☆ TUGS: Physics-based Compact Representation of Underwater Scenes by Tensorized Gaussian
Underwater 3D scene reconstruction is crucial for multimedia applications in adverse environments, such as underwater robotic perception and navigation. However, the complexity of interactions between light propagation, water medium, and object surfaces poses significant difficulties for existing methods in accurately simulating their interplay. Additionally, expensive training and rendering costs limit their practical application. Therefore, we propose Tensorized Underwater Gaussian Splatting (TUGS), a compact underwater 3D representation based on physical modeling of complex underwater light fields. TUGS includes a physics-based underwater Adaptive Medium Estimation (AME) module, enabling accurate simulation of both light attenuation and backscatter effects in underwater environments, and introduces Tensorized Densification Strategies (TDS) to efficiently refine the tensorized representation during optimization. TUGS is able to render high-quality underwater images with faster rendering speeds and less memory usage. Extensive experiments on real-world underwater datasets have demonstrated that TUGS can efficiently achieve superior reconstruction quality using a limited number of parameters. The code is available at https://liamlian0727.github.io/TUGS
Optimization and Control 56
☆ Quantification of ergodicity for Hamilton--Jacobi equations in a dynamic random environment
We study quantitative large-time averages for Hamilton--Jacobi equations in a dynamic random environment that is stationary ergodic and has unit-range dependence in time. Our motivation comes from stochastic growth models related to the tensionless (inviscid) KPZ equation, which can be formulated as a Hamilton--Jacobi equation with random forcing. Understanding the large-time averaged behavior of solutions is closely connected to fundamental questions about fluctuations and scaling in such growth processes.
comment: 51 pages, 3 figures
☆ Set-Based Value Function Characterization and Neural Approximation of Stabilization Domains for Input-Constrained Discrete-Time Systems
Analyzing nonlinear systems with stabilizable controlled invariant sets (CISs) requires accurate estimation of their domains of stabilization (DOS) together with associated stabilizing controllers. Despite extensive research, estimating DOSs for general nonlinear systems remains challenging due to fundamental theoretical and computational limitations. In this paper, we propose a novel framework for estimating DOSs for controlled input-constrained discrete-time systems. The DOS is characterized via newly introduced value functions defined on metric spaces of compact sets. We establish the fundamental properties of these value functions and derive the associated Bellman-type (Zubov-type) functional equations. Building on this characterization, we develop a physics-informed neural network (NN) framework that learns the value functions by embedding the derived functional equations directly into the training process. The proposed methodology is demonstrated through two numerical examples, illustrating its ability to accurately estimate DOSs and synthesize stabilizing controllers from the learned value functions.
☆ A Gradient Sampling Algorithm for Noisy Nonsmooth Optimization
An algorithm is proposed, analyzed, and tested for minimizing locally Lipschitz objective functions that may be nonconvex and/or nonsmooth. The algorithm, which is built upon the gradient-sampling methodology, is designed specifically for cases when objective function and generalized gradient values might be subject to bounded uncontrollable errors. Similarly to state-of-the-art guarantees for noisy smooth optimization of this kind, it is proved for the algorithm that, with probability one, either the sequence of objective function values will decrease without bound or the algorithm will generate an iterate at which a measure of stationarity is below a threshold that depends proportionally on the error bounds for the objective function and generalized gradient values. The results of numerical experiments are presented, which show that the algorithm can indeed perform approximate optimization robustly despite errors in objective and generalized gradient values.
☆ Learning to Shuffle: Block Reshuffling and Reversal Schemes for Stochastic Optimization
Shuffling strategies for stochastic gradient descent (SGD), including incremental gradient, shuffle-once, and random reshuffling, are supported by rigorous convergence analyses for arbitrary within-epoch permutations. In particular, random reshuffling is known to improve optimization constants relative to cyclic and shuffle-once schemes. However, existing theory offers limited guidance on how to design new data-ordering schemes that further improve optimization constants or stability beyond random reshuffling. In this paper, we design a pipeline using a large language model (LLM)-guided program evolution framework to discover an effective shuffling rule for without-replacement SGD. Abstracting from this instance, we identify two fundamental structural components: block reshuffling and paired reversal. We analyze these components separately and show that block reshuffling strictly reduces prefix-gradient variance constants within the unified shuffling framework, yielding provable improvements over random reshuffling under mild conditions. Separately, we show that paired reversal symmetrizes the epoch map and cancels the leading order-dependent second-order term, reducing order sensitivity from quadratic to cubic in the step size. Numerical experiments with the discovered algorithm validate the theory and demonstrate consistent gains over standard shuffling schemes across convex and nonconvex benchmarks.
☆ Dissipation-assisted stabilization of periodic orbits via actuated exterior impacts in hybrid mechanical systems with symmetry
Impulsive mechanical systems exhibit discontinuous jumps in their state, and when such jumps are triggered by spatial events, the geometry of the impact surface carries information about the controllability of the hybrid dynamics. For mechanical systems defined on principal $G$-bundles, two qualitatively distinct types of impacts arise: interior impacts, associated with events on the shape space, and exterior impacts, associated with events on the fibers. A key distinction is that interior impacts preserve the mechanical connection, whereas exterior impacts generally do not. In this paper, we exploit this distinction by allowing actuation through exterior impacts. We study the pendulum-on-a-cart system, derive controlled reset laws induced by moving-wall impacts, and analyze the resulting periodic motions. Our results show that reset action alone does not provide a convincing stabilizing regime, whereas the addition of dissipation in the continuous flow yields exponentially stable periodic behavior for suitable feedback gains.
☆ Deception in Linear-Quadratic Control
Systems operating in adversarial environments may inadvertently leak sensitive information to adversaries. To address this challenge, we revisit the linear-quadratic control framework and introduce deception to actively mislead adversaries. Specifically, we consider a blue-team agent, observed by a red-team agent, that seeks to minimize a quadratic cost while introducing perturbations to its trajectories over time. These perturbations are designed to corrupt the red team's observations and, consequently, any downstream inferences, while remaining undetected by a red team using sequential hypothesis testing. We implement this idea by augmenting the blue team's quadratic cost with a likelihood ratio statistic. Under this augmented control problem, we derive a semi-explicit solution for the optimal deceptive control law and establish corresponding well-posedness results. In addition, we provide both numerical approximations and analytical bounds for the probability that the red team detects the blue team's deceptive strategies. Numerical results demonstrate the effectiveness of the proposed framework in deceiving the red team while remaining undetected with probability near 1.
☆ Risk-averse optimization under distributional uncertainty with Rockafellian relaxation
A framework for risk-averse optimization problems is introduced that is resilient to ambiguities in the true form of the underlying probability distribution. The focus is on problems with partial differential equations (PDEs) as constraints, although the formulation is more broadly applicable. The framework is based on combining risk measures with problem relaxation techniques, and it builds off of previous advances for risk-neutral problems. This work advances the existing theory with strengthened $Γ$-convergence results, novel existence results and first-order optimality criteria. In particular, the theoretical approach naturally accommodates infinite-dimensional probability spaces; no finite-dimensional noise assumption is needed. The framework blends aspects of both distributionally robust optimization (DRO) and distributionally optimistic optimization (DOO) approaches. The DRO aspect facilitates strong out-of-sample performance, while the DOO aspect takes care of adversarial and outlier data, as illustrated with numerical examples.
☆ Parametric Region Search: A Mixed-Integer Bilevel Optimization Problem Primal Heuristic
Bilevel optimization is a mathematical modeling formulation for hierarchical systems and two-player interactions, with wide-ranging applications in environmental, energy, and control engineering. Despite its utility, the mixed-integer bilevel optimization (MIBO) problem is exceptionally challenging to solve. While numerous exact and metaheuristic methods exist, the development of specialized primal heuristics for MIBO, aimed at quickly identifying high-quality feasible solutions, remains an underexplored area. This paper introduces the Parametric Region Search (PRS), a new primal heuristic for MIBO. The PRS method leverages insights from multi-parametric optimization by iteratively exploring regions defined by the lower-level problem's critical regions. We formally define the MIBO structure and the necessary parametric region formulations, and then detail the proposed heuristic's initialization and iterative search mechanism. Computational results demonstrate that the PRS heuristic consistently locates high-quality primal solutions compared to established derivative-free metaheuristics, including DOMINO-COBYLA and DOMINO-ISRES. Furthermore, we illustrate how the PRS can be effectively integrated with other heuristics like DOMINO-COBYLA to enhance the overall solution discovery process for MIBO.
☆ On optimal dividend and capital injection strategies for Markov additive processes
We study de Finetti's optimal dividend problem, including capital injections, under the assumption that uncontrolled assets behave as the additive component of a Markov additive process. We assume that capital injections can be received at any time. Meanwhile, we primarily address cases in which dividends can only be paid at specific, discrete times, and we establish the necessary and sufficient conditions for the optimality of the strategies. Additionally, we prove the optimality of certain Markov-modulated periodic-classical barrier strategies and cover cases where dividends can be paid at any time via approximation. A key feature of this research is our use of a significantly more general Markov additive process than in prior studies, which leads to different proof approaches.
comment: 43 pages
☆ Sparse Copositive Polynomial Optimization
This paper studies the copositive optimization problem whose objective is a sparse polynomial, with linear constraints over the nonnegative orthant. We propose sparse Moment-SOS relaxations to solve it. Necessary and sufficient conditions are shown for these relaxations to be tight. In particular, we prove they are tight under the cop-SOS convexity assumption. Compared to the traditional dense ones, the sparse Moment-SOS relaxations are more computationally efficient. Numerical experiments are given to show the efficiency.
comment: 21 pages
☆ Stratified adaptive sampling for derivative-free stochastic trust-region optimization
There is emerging evidence that trust-region (TR) algorithms are very effective at solving derivative-free nonconvex stochastic optimization problems in which the objective function is a Monte Carlo (MC) estimate. A recent strand of methodologies adaptively adjusts the sample size of the MC estimates by keeping the estimation error below a measure of stationarity induced from the TR radius. In this work we explore stratified adaptive sampling strategies to equip the TR framework with accurate estimates of the objective function, thus optimizing the required number of MC samples to reach a given ε-accuracy of the solution. We prove a reduced sample complexity, confirm a superior efficiency via numerical tests and applications, and explore inexpensive implementations in high dimension.
☆ Contracting Neural Networks: Sharp LMI Conditions with Applications to Integral Control and Deep Learning
This paper studies contractivity of firing-rate and Hopfield recurrent neural networks. We derive sharp LMI conditions on the synaptic matrices that characterize contractivity of both architectures, for activation functions that are either non-expansive or monotone non-expansive, in both continuous and discrete time. We establish structural relationships among these conditions, including connections to Schur diagonal stability and the recovery of optimal contraction rates for symmetric synaptic matrices. We demonstrate the utility of these results through two applications. First, we develop an LMI-based design procedure for low-gain integral controllers enabling reference tracking in contracting firing rate networks. Second, we provide an exact parameterization of weight matrices that guarantee contraction and use it to improve the expressivity of Implicit Neural Networks, achieving competitive performance on image classification benchmarks with fewer parameters.
comment: Submitted to CDC 2026
☆ Speeding Up Mixed-Integer Programming Solvers with Sparse Learning for Branching
Machine learning is increasingly used to improve decisions within branch-and-bound algorithms for mixed-integer programming. Many existing approaches rely on deep learning, which often requires very large training datasets and substantial computational resources for both training and deployment, typically with GPU parallelization. In this work, we take a different path by developing interpretable models that are simple but effective. We focus on approximating strong branching (SB) scores, a highly effective yet computationally expensive branching rule. Using sparse learning methods, we build models with fewer than 4% of the parameters of a state-of-the-art graph neural network (GNN) while achieving competitive accuracy. Relative to SCIP's built-in branching rules and the GNN-based model, our CPU-only models are faster than the default solver and the GPU-accelerated GNN. The models are simple to train and deploy, and they remain effective with small training sets, which makes them practical in low-resource settings. Extensive experiments across diverse problem classes demonstrate the efficiency of this approach.
comment: 21 pages, 2 figures
☆ A McKean-Pontrygin maximum principle for entropic-regularized optimal transport
This note outlines a mean-field approach to dynamic optimal transport problems based on the recently proposed McKean-Pontryagin maximum principle. Key aspects of the proposed methodology include i) avoidance of sampling over stochastic paths, ii) a fully variational approach leading to constrained Hamiltonian equations of motion, and iii) a unified treatment of deterministic and stochastic optimal transport problems. We also discuss connections to well-known dynamic formulations in terms of forward-backward stochastic differential equations and extensions beyond classical entropic-regularized transport problems.
☆ Quantale-Enriched Co-Design: Toward a Framework for Quantitative Heterogeneous System Design
Monotone co-design enables compositional engineering design by modeling components through feasibility relations between required resources and provided functionalities. However, its standard boolean formulation cannot natively represent quantitative criteria such as cost, confidence, or implementation choice. In practice, these quantities are often introduced through ad hoc scalarization or by augmenting the resource space, which obscures system structure and increases computational burden. We address this limitation by developing a quantale-enriched theory of co-design. We model resources and functionalities as quantale-enriched categories and design problems as quantale-enriched profunctors, thereby lifting co-design from boolean feasibility to general quantitative evaluation. We show that the fundamental operations of series, parallel, and feedback composition remain valid over arbitrary commutative quantales. We further introduce heterogeneous composition through change-of-base maps between quantales, enabling different subsystems to be evaluated in different local semantics and then composed in a common framework. The resulting theory unifies feasibility-, cost-, confidence-, and implementation-aware co-design within one compositional formalism. Numerical examples on a target-tracking system and a UAV delivery problem demonstrate the framework and highlight how native quantitative enrichment can avoid the architectural and computational drawbacks of boolean-only formulations.
☆ Analyzing Performance and Scalability of Benders Decomposition for Generation and Transmission Expansion Planning Models
Generation and Transmission Expansion Planning (GTEP) problems co-optimize generation and transmission expansion, enabling them to provide better planning decisions than traditional Generation Expansion Planning or Transmission Expansion Planning problems, but GTEPs can be computationally complex or intractable. Benders Decomposition (BD) has been applied to expansion planning problems, with various methods applied to accelerate convergence. In this work, we test strategies for improving the performance of BD on GTEP models with nodal resolution and DCOPF constraints. We also present an alternative approach for handling the bilinear constraints that can result in these problems. These tests included combinations of using generalized Benders decomposition (GBD), hot-starting via a transport constrained model, using linear relaxations of the master problem, and using regularization. We test these methods on mixed-integer linear programming GTEP models with up to 146 buses (10 million continuous variables and 400 mixed-integer decisions). With selected accelerated Benders decomposition approaches, the problems can be solved to under a 1\% gap in as little as 5 hours where they were otherwise intractable. Results also suggest that using regularization on these initial hot-starting and relaxation steps and turning it off after they are complete was generally the best combination of strategies.
comment: 11 pages, 7 figures, 1 table. This work has been submitted to the IEEE for possible publication
☆ Noncooperative Game in Multi-controller System under Delayed and Asymmetric Information
We address a noncooperative game problem in multi-controller system under delayed and asymmetric information structure. Under these conditions, the classical separation principle fails as estimation and control design become strongly coupled, complicating the derivation of an explicit Nash equilibrium. To resolve this, we employ a common-private information decomposition approach, effectively decoupling control inputs and state estimation to obtain a closed-form Nash equilibrium. By applying a forward iterative method, we establish the convergence of the coupled Riccati and estimation error covariance recursions, yielding both the steady-state Kalman filter and the Nash equilibrium. Furthermore, we quantify the impact of asymmetric information, proving that a richer information set reduces the costs for the corresponding player. Finally, numerical examples are provided to demonstrate the effectiveness of the results.
comment: 10 pages, 6 figures
☆ Generalizing Output-Feedback Covariance Steering to Incorporate Non-Orthogonal Estimation Errors
This paper addresses the problem of steering a state distribution over a finite horizon in discrete time with output feedback. The incorporation of output feedback introduces additional challenges arising from the statistical coupling between the true state distribution and the corresponding filtered state distribution. In particular, this paper extends existing distribution steering formulations to scenarios in which estimation errors are not orthogonal to the state estimates. A general framework is developed to capture this non-orthogonality, and the resulting problem is formulated in a form solvable via sequential convex programming with rank constraints. The proposed approach generalizes existing methods and is validated through numerical examples and Monte Carlo simulations, including cases with non-orthogonal estimation errors that prior techniques cannot address.
☆ Nonnegative Matrix Factorization in the Component-Wise L1 Norm for Sparse Data
Nonnegative matrix factorization (NMF) approximates a nonnegative matrix, $X$, by the product of two nonnegative factors, $WH$, where $W$ has $r$ columns and $H$ has $r$ rows. In this paper, we consider NMF using the component-wise L1 norm as the error measure (L1-NMF), which is suited for data corrupted by heavy-tailed noise, such as Laplace noise or salt and pepper noise, or in the presence of outliers. Our first contribution is an NP-hardness proof for L1-NMF, even when $r=1$, in contrast to the standard NMF that uses least squares. Our second contribution is to show that L1-NMF strongly enforces sparsity in the factors for sparse input matrices, thereby favoring interpretability. However, if the data is affected by false zeros, too sparse solutions might degrade the model. Our third contribution is a new, more general, L1-NMF model for sparse data, dubbed weighted L1-NMF (wL1-NMF), where the sparsity of the factorization is controlled by adding a penalization parameter to the entries of $WH$ associated with zeros in the data. The fourth contribution is a new coordinate descent (CD) approach for wL1-NMF, denoted as sparse CD (sCD), where each subproblem is solved by a weighted median algorithm. To the best of our knowledge, sCD is the first algorithm for L1-NMF whose complexity scales with the number of nonzero entries in the data, making it efficient in handling large-scale, sparse data. We perform extensive numerical experiments on synthetic and real-world data to show the effectiveness of our new proposed model (wL1-NMF) and algorithm (sCD).
comment: 21 pages before supplementary, code available from https://github.com/giovanniseraghiti/wL1-NMF
☆ Distributed Equilibria for $N$-Player Differential Games with Interaction through Controls: Existence, Uniqueness and Large $N$ Limit
We establish the existence and uniqueness of distributed equilibria to possibly nonsymmetric $N$ player differential games with interactions through controls under displacement semimonotonicity assumptions. Surprisingly, the nonseparable framework of the running cost combined with the character of distributed equilibria leads to a set of consistency relations different in nature from the ones for open and closed loop equilibria investigated in a recent work of Jackson and the second author. In the symmetric setting, we establish quantitative convergence results for the $N$ player games toward the corresponding Mean Field Games of Controls (MFGC), when $N\to+\infty$. Our approach applies to both idiosyncratic noise driven models and fully deterministic frameworks. In particular, for deterministic models distributed equilibria correspond to open loop equilibria, and our work seems to be the first in the literature to provide existence and uniqueness of these equilibria and prove the large $N$ convergence in the MFGC setting. The sharpness of the imposed assumptions is discussed via a set of explicitly computable examples in the linear quadratic setting.
☆ On Lipschitzian properties of multifunctions defined implicitly by "split" feasibility problems
In the present paper, a systematic study is made of quantitative semicontinuity (a.k.a. Lipschitzian) properties of certain multifunctions, which are defined as a solution map associated to a family of parameterized ``split" feasibility problems. The latter are a particular class of convex feasibility problems with well recognized applications to several areas of engineering and systems biology. As a part of a perturbation analysis of variational systems, this study falls within the framework of a line of research pursued by several authors. It is performed by means of techniques of variational analysis, which lead to establish sufficient conditions for the Lipschitz lower semicontinuity, calmness, isolated calmness, Lipschitz upper semicontinuity and Aubin property of the solution map. Along with each of these properties, a quantitative estimate of the related exact bound is also provided. The key elements emerging on the way to achieving the main results are dual regularity conditions qualifying the problem behaviour, which are expressed in terms of convex analysis constructions involving problem data. The approach here proposed tries to unify the study of the aforementioned properties.
☆ An objective-function-free algorithm for nonconvex stochastic optimization with deterministic equality and inequality constraints
An algorithm is proposed for solving optimization problems with stochastic objective and deterministic equality and inequality constraints. This algorithm is objective-function-free in the sense that it only uses the objective's gradient and never evaluates the function value. It is based on an adaptive selection of function-decreasing and constraint-improving iterations, the first ones using an Adagrad-type stepsize. When applied to problems with full-rank Jacobian, the combined primal-dual optimality measure is shown to decrease at the rate of O(1/sqrt{k}), which is identical to the convergence rate of first-order methods in the unconstrained case.
☆ Adaptive Mitigation of Insider Threats via Off-Policy Learning
An insider is a team member who covertly deviates from the team's optimal collaborative strategy to pursue a private objective while still appearing cooperative. Such an insider may initially behave cooperatively but later switch to selfish or malicious actions, thereby degrading collective performance, threatening mission success, and compromising operational safety. In this paper, we study such insider threats within an insider-aware, game-theoretic formulation, where the insider interacts with a decision maker (DM) under a continuous-time switched system, with each time interval characterized by a distinct insider behavioral pattern or threat level. We develop a periodic off-policy mitigation scheme that enables the DM to learn optimal mitigation policies from online data when encountering different insider threats, without requiring a priori knowledge of insider intentions. By designing appropriate conditions on the inter-learning interval time, we establish convergence guarantees for both the learning process and the closed-loop system, and characterize the corresponding mitigation performance achieved by the DM.
☆ Stochastic Model Predictive Control based on Mixed Random Variables for Economic Energy Management
Optimal scheduling of batteries has significant potential to reduce electricity costs and to enhance grid resilience. However, effective battery scheduling must account for both physical constraints as well as uncertainties in consumption and generation of renewable energy sources. Instead of optimizing fixed battery power setpoints, we propose an approach that optimizes battery power intervals, allowing the optimization to explicitly account for uncertain consumption and generation as well as how the battery system should respond to them within its physical limits. Our method is based on mixed random variables, represented as mixtures of discrete and continuous probability distributions. Building on this representation, we develop an analytical stochastic formulation for minimizing electricity costs in a residential setting with load, photovoltaics, and battery storage. We demonstrate its effectiveness across real-world data from 15 residential buildings over five consecutive months. Compared with deterministic and probabilistic benchmark controllers, the proposed interval-based optimization achieves the lowest costs. These results show that mixed random variables are a practical and promising tool for decision-making under uncertainty.
☆ Randomstrasse101: Open Problems of 2025
Randomstrasse101 is a blog dedicated to Open Problems in Mathematics, with a focus on Probability Theory, Computation, Combinatorics, Statistics, and related topics. This manuscript serves as a stable record of the Open Problems posted in 2025, with the goal of easing academic referencing. The blog can currently be accessed at randomstrasse101.math.ethz.ch
☆ A Mollification Approach to Ramified Transport and Tree Shape Optimization
The paper analyzes a mollification algorithm, for the numerical computation of optimal irrigation patterns. This provides a regularization of the standard irrigation cost functional, in a Lagrangian framework. Lower semicontinuity and Gamma-convergence results are proved. The technique is then applied to some numerical optimization problems, related to the optimal shape of tree roots and branches.
comment: 19 pages, 5 figures, submitted
☆ Distributed Predictive Control Barrier Functions: Towards Scalable Safety Certification in Modular Multi-Agent Systems
We consider safety-critical multi-agent systems with distributed control architectures and potentially varying network topologies. While learning-based distributed control enables scalability and high performance, a lack of formal safety guarantees in the face of unforeseen disturbances and unsafe network topology changes may lead to system failure. To address this challenge, we introduce structured control barrier functions (s-CBFs) as a multi-agent safety framework. The s-CBFs are augmented to a distributed predictive control barrier function (D-PCBF), a predictive, optimization-based safety layer that uses model predictions to guarantee recoverable safety at all times. The proposed approach enables a permissive yet formal plug-and-play protocol, allowing agents to join or leave the network while ensuring safety recovery if a change in network topology requires temporarily unsafe behavior. We validate the formulation through simulations and real-time experiments of a miniature race-car platoon.
comment: This work has been submitted to the IEEE for possible publication
Model Predictive Path Integral PID Control for Learning-Based Path Following
Classical proportional--integral--derivative (PID) control is widely employed in industrial applications; however, achieving higher performance often motivates the adoption of model predictive control (MPC). Although gradient-based methods are the standard for real-time optimization, sampling-based approaches have recently gained attention. In particular, model predictive path integral (MPPI) control enables gradient-free optimization and accommodates non-differentiable models and objective functions. However, directly sampling control input sequences may yield discontinuous inputs and increase the optimization dimensionality in proportion to the prediction horizon. This study proposes MPPI--PID control, which applies MPPI to optimize PID gains at each control step, thereby replacing direct high-dimensional input-sequence optimization with low-dimensional gain-space optimization. This formulation enhances sample efficiency and yields smoother inputs via the PID structure. We also provide theoretical insights, including an information-theoretic interpretation that unifies MPPI and MPPI--PID, an analysis of the effect of optimization dimensionality on sample efficiency, and a characterization of input continuity induced by the PID structure. The proposed method is evaluated on the learning-based path following of a mini forklift using a residual-learning dynamics model that integrates a physical model with a neural network. System identification is performed with real driving data. Numerical path-following experiments demonstrate that MPPI--PID improves tracking performance compared with fixed-gain PID and achieves performance comparable to conventional MPPI while significantly reducing input increments. Furthermore, the proposed method maintains favorable performance even with substantially fewer samples, demonstrating its improved sample efficiency.
comment: Submitted to IFAC Journal of Systems and Control
☆ Stochastic homogenization of nonconvex unbounded integral functionals with generalized Orlicz growth
We consider the homogenization of random integral functionals which are possibly unbounded, that is, the domain of the integrand is not the whole space and may depend on the space-variable. In the vectorial case, we develop a complete stochastic homogenization theory for nonconvex unbounded functionals with convex growth of generalized Orlicz-type, under a standard set of assumptions in the field, in particular a coercivity condition of order $p^->1$, and an upper bound of order $p^+<\infty$. The limit energy is defined in a possibly anisotropic Musielak-Orlicz space, for which approximation results with smooth functions are provided. The proof is based on the localization method of $Γ$-convergence and a careful use of truncation arguments.
☆ Stochastic Block Bregman Projection with Polyak-like Stepsize for Possibly Inconsistent Convex Feasibility Problems
Stochastic projection algorithms for solving convex feasibility problems (CFPs) have attracted considerable attention due to their broad applicability. In this paper, we propose a unified stochastic bilevel reformulation for possibly inconsistent CFPs that combines proximity function minimization and structural regularization, leading to a feasible bilevel model with a unique and stable regularized solution. From the algorithmic perspective, we develop the stochastic block Bregman projection method with Polyak-like and projective stepsizes, which not only subsumes several recent stochastic projection algorithms but also induces new schemes tailored to specific problems. Moreover, we establish ergodic sublinear convergence rates for the expected inner function, as well as linear convergence in expectation to the inner minimizer set under a Bregman distance growth condition. In particular, the proposed Polyak-like stepsize ensures exact convergence in expectation for possibly inconsistent CFPs. Finally, numerical experiments demonstrate the effectiveness of the proposed method and its robustness to noise.
comment: 41pages, 6 figures
☆ One-parameter Filled Function Method for Non-convex Multi-objective Optimization Problems
In this paper, a new one-parameter filled function approach is developed for nonlinear multi-objective optimization. Inspired by key filled function ideas from single-objective optimization, the proposed method is adapted to the multi-objective setting. It avoids scalarization weights and does not impose any prior preference ordering among objectives. A descent-based procedure is first applied to obtain a local weak efficient solution. An associated filled function is then constructed and used to derive another local weak efficient solution that improves upon the current one. Repeating these phases drives the search toward global weak efficiency and yields an approximation of the global Pareto front, including in non-convex problems where multiple local Pareto fronts may exist. Numerical experiments on a set of test problems demonstrate the effectiveness of the proposed approach relative to existing methods.
comment: 21 Pages, 1 Table, 5 figures (10 subfigures)
☆ Convergence analysis of dynamical systems for optimization by an improved Lyapunov framework
We study the convergence analysis of continuous-time dynamical systems associated with optimization methods for strongly convex functions. Recent works have proposed systematic constructions of Lyapunov functions for such analysis, while also revealing limitations of the Lyapunov analysis. Aujol--Dossal--Rondepierre (2023) have proposed a technique to address this issue by reorganizing Lyapunov functions so as to evaluate a quantity $f(x(t)) - f_* - g(t)\|x(t)-x_*\|^2$ rather than $f(x(t)) - f_*$. By combining this technique with our computer-assisted framework to discover Lyapunov functions, we develop an improved method that reproduces an existing convergence rate or yields better rates than previous studies.
comment: 4 pages
☆ Receding-Horizon Policy Gradient for Polytopic Controller Synthesis
We propose the Polytopic Receding-Horizon Policy Gradient (P-RHPG) algorithm for synthesizing Parallel Distributed Compensation (PDC) controllers via Tensor Product (TP) model transformation. Standard LMI-based PDC synthesis grows increasingly conservative as model fidelity improves; P-RHPG instead solves a finite-horizon integrated cost via backward-stage decomposition. The key result is that each stage subproblem is a strongly convex quadratic in the vertex gains, a consequence of the linear independence of the HOSVD weighting functions, guaranteeing a unique global minimizer and linear convergence of gradient descent from any initialization. With zero terminal cost, the optimal cost increases monotonically to a finite limit and the gain sequence remains bounded; terminal costs satisfying a mild Lyapunov condition yield non-increasing convergence. Experiments on an aeroelastic wing benchmark confirm convergence to a unique infinite-horizon optimum across all tested terminal cost choices and near-optimal performance relative to the pointwise Riccati lower bound.
☆ Adaptive Momentum via Minimal Dual Function for Accelerating Randomized Sparse Kaczmarz
Recently, the randomized sparse Kaczmarz method has been accelerated by designing heavy ball momentum adaptively via a minimal-error principle. In this paper, we develop a new adaptive momentum method based on the minimal dual function principle to go beyond the exact measurement restriction of the minimal-error principle. Moreover, by integrating the new adaptive momentum method with the quantile-based sampling, we introduce a general algorithmic framework, called quantile-based randomized sparse Kaczmarz with minimal dual function momentum, which provides a unified approach to exact, noisy, or corrupted linear systems. In addition, we utilize the discrepancy principle and monotone error as stopping rules for the proposed algorithm. Theoretically, we establish linear convergence in expectation of Bregman distance up to a finite horizon related to the contaminated level. At last, we provide numerical illustrations on simulated and real-world data to demonstrate the effectiveness of our proposed method.
comment: 30pages, 4 figures
☆ Bilevel MPC for Linear Systems: A Tractable Reduction and Continuous Connection to Hierarchical MPC
Model predictive control (MPC) has been widely used in many fields, often in hierarchical architectures that combine controllers and decision-making layers at different levels. However, when such architectures are cast as bilevel optimization problems, standard KKT-based reformulations often introduce nonconvex and potentially nonsmooth structures that are undesirable for real-time verifiable control. In this paper, we study a bilevel MPC architecture composed of (i) an upper layer that selects the reference sequence and (ii) a lower-level linear MPC that tracks such reference sequence. We propose a smooth single-level reduction that does not degrade performance under a verifiable block-matrix nonsingularity condition. In addition, when the problem is convex, its solution is unique and equivalent to a corresponding centralized MPC, enabling the inheritance of closed-loop properties. We further show that bilevel MPC is a natural extension of standard hierarchical MPC, and introduce an interpolation framework that continuously connects the two via move-blocking. This framework reveals optimal-value ordering among the resulting formulations and provides inexpensive a posteriori degradation certificates, thereby enabling a principled performance-computational efficiency trade-off.
comment: Submitted to CDC 2026. Code: https://github.com/StanfordASL/Reduced_BMPC
☆ Pointwise and dynamic programming control synthesis for finite-level open quantum memory systems
This paper is concerned with finite-level quantum memory systems for retaining initial dynamic variables in the presence of external quantum noise. The system variables have an algebraic structure, similar to that of the Pauli matrices, and their Heisenberg picture evolution is governed by a quasilinear quantum stochastic differential equation. The latter involves a Hamiltonian whose parameters depend affinely on a classical control signal in the form of a deterministic function of time. The memory performance is quantified by a mean-square deviation of quantum system variables of interest from their initial conditions. We relate this functional to a matrix-valued state of an auxiliary classical control-affine dynamical system. This leads to a pointwise control design where the control signal minimises the time-derivative of the mean-square deviation with an additional quadratic penalty on the control. In an alternative finite-horizon setting with a terminal-integral cost functional, we apply dynamic programming and obtain a quadratically nonlinear Hamilton-Jacobi-Bellman equation, for which a solution is outlined in the form of a recursively computed asymptotic expansion.
comment: 11 pages, 1 figure, submitted to CDC 2026
☆ QOCO-GPU: A Quadratic Objective Conic Optimizer with GPU Acceleration
We present a GPU-accelerated backend for QOCO, a C-based solver for quadratic objective second-order cone programs (SOCPs) based on a primal-dual interior point method. Our backend uses NVIDIA's cuDSS library to perform a direct sparse LDL factorization of the KKT system at each iteration. We also develop custom CUDA kernels for cone operations and show that parallelizing these operations is essential for achieving peak performance. Additionally, we refactor QOCO to introduce a modular backend abstraction that decouples solver logic from the underlying linear algebra implementations, allowing the existing CPU and new GPU backend to share a unified codebase. This GPU backend is accessible through a direct Python interface and through CVXPY, allowing for easy use. Numerical experiments on a range of large-scale quadratic programs and SOCPs with tens to hundreds of millions of nonzero elements in the KKT matrix, demonstrate speedups of up to 50-70 times over the CPU implementation.
☆ Adaptive Delayed-Update Cyclic Algorithm for Variational Inequalities
Cyclic block coordinate methods are a fundamental class of first-order algorithms, widely used in practice for their simplicity and strong empirical performance. Yet, their theoretical behavior remains challenging to explain, and setting their step sizes -- beyond classical coordinate descent for minimization -- typically requires careful tuning or line-search machinery. In this work, we develop $\texttt{ADUCA}$ (Adaptive Delayed-Update Cyclic Algorithm), a cyclic algorithm addressing a broad class of Minty variational inequalities with monotone Lipschitz operators. $\texttt{ADUCA}$ is parameter-free: it requires no global or block-wise Lipschitz constants and uses no per-epoch line search, except at initialization. A key feature of the algorithm is using operator information delayed by a full cycle, which makes the algorithm compatible with parallel and distributed implementations, and attractive due to weakened synchronization requirements across blocks. We prove that $\texttt{ADUCA}$ attains (near) optimal global oracle complexity as a function of target error $ε>0,$ scaling with $1/ε$ for monotone operators, or with $\log^2(1/ε)$ for operators that are strongly monotone.
☆ Primal-dual dynamics featuring Hessian-driven damping and variable mass for convex optimization problems
This paper deals with a new Tikhonov regularized primal-dual dynamical system with variable mass and Hessian-driven damping for solving a convex optimization problem with linear equality constraints. The system features several time-dependent parameters: variable mass, slow viscous damping, extrapolation, and temporal scaling. By employing the Lyapunov analysis approach, we obtain the strong convergence of the trajectory generated by the proposed system to the minimal norm solution of the optimization problem, as well as convergence rate results for the primal-dual gap, the objective residual, and the feasibility violation. We also show that the convergence rates of the primal-dual gap, the objective residual, and the feasibility violation can be improved by appropriately adjusting these parameters. Further, we conduct numerical experiments to demonstrate the effectiveness of the theoretical results.
☆ Sampling-Horizon Neural Operator Predictors for Nonlinear Control under Delayed Inputs
Modern control systems frequently operate under input delays and sampled state measurements. A common delay-compensation strategy is predictor feedback; however, practical implementations require solving an implicit ODE online, resulting in intractable computational cost. Moreover, predictor formulations typically assume continuously available state measurements, whereas in practice measurements may be sampled, irregular, or temporarily missing due to hardware faults. In this work, we develop two neural-operator predictor-feedback designs for nonlinear systems with delayed inputs and sampled measurements. In the first design, we introduce a sampling-horizon prediction operator that maps the current measurement and input history to the predicted state trajectory over the next sampling interval. In the second design, the neural operator approximates only the delay-compensating predictor, which is then composed with the closed-loop flow between measurements. The first approach requires uniform sampling but yields residual bounds that scale directly with the operator approximation error. In contrast, the second accommodates non-uniform, but bounded sampling schedules at the cost of amplified approximation error, revealing a practical tradeoff between sampling flexibility and approximation sensitivity for the control engineer. For both schemes, we establish semi-global practical stability with explicit neural operator error-dependent bounds. Numerical experiments on a 6-link nonlinear robotic manipulator demonstrate accurate tracking and substantial computational speedup of 25$\times$ over a baseline approach.
comment: 6 pages
☆ Predictor-Based Output-Feedback Control of Linear Systems with Time-Varying Input and Measurement Delays via Neural-Approximated Prediction Horizons
Due to simplicity and strong stability guarantees, predictor feedback methods have stood as a popular approach for time delay systems since the 1950s. For time-varying delays, however, implementation requires computing a prediction horizon defined by the inverse of the delay function, which is rarely available in closed form and must be approximated. In this work, we formulate the inverse delay mapping as an operator learning problem and study predictor feedback under approximation of the prediction horizon. We propose two approaches: (i) a numerical method based on time integration of an equivalent ODE, and (ii) a data-driven method using neural operators to learn the inverse mapping. We show that both approaches achieve arbitrary approximation accuracy over compact sets, with complementary trade-offs in computational cost and scalability. Building on these approximations, we then develop an output-feedback predictor design for systems with delays in both the input and the measurement. We prove that the resulting closed-loop system is globally exponentially stable when the prediction horizon is approximated with sufficiently small error. Lastly, numerical experiments validate the proposed methods and illustrate their trade-offs between accuracy and computational efficiency.
comment: 11 Pages. Preprint
♻ ☆ High order Tensor-Train-Based Schemes for High-Dimensional Mean Field Games
We introduce a fully discrete scheme to solve a class of high-dimensional Mean Field Games systems. Our approach couples semi-Lagrangian (SL) time discretizations with Tensor-Train (TT) decompositions to tame the curse of dimensionality. By reformulating the classical Hamilton-Jacobi-Bellman and Fokker-Planck equations as a sequence of advection-diffusion-reaction subproblems within a smoothed policy iteration, we construct both first and second order in time SL schemes. The TT format and appropriate quadrature rules reduce storage and computational cost from exponential to polynomial in the dimension. Numerical experiments demonstrate that our TT-accelerated SL methods achieve their theoretical convergence rates, exhibit modest growth in memory usage and runtime with dimension, and significantly outperform grid-based SL in accuracy per CPU second.
♻ ☆ Differential estimates for fast first-order multilevel nonconvex optimisation
With a view on bilevel and PDE-constrained optimisation, we develop iterative estimates $\widetilde{F'}(x^k)$ of $F'(x^k)$ for composite functions $F :=J \circ S$, where $S$ is the solution mapping of the inner optimisation problem or PDE. The idea is to form a single-loop method by interweaving updates of the iterate $x^k$ by an outer optimisation method, with updates of the estimate by single steps of standard optimisation methods and linear system solvers. When the inner methods satisfy simple tracking inequalities, the differential estimates can almost directly be employed in standard convergence proofs for general forward-backward type methods. We adapt those proofs to a general inexact setting in normed spaces, that, besides our differential estimates, also covers mismatched adjoints and unreachable optimality conditions in measure spaces. As a side product of these efforts, we provide improved convergence results for nonconvex Primal-Dual Proximal Splitting (PDPS). We numerically evaluate our methods on Electrical Impedance Tomography (EIT) and minimal surface control.
♻ ☆ Multi-Objective Optimization with Desirability and Morris-Mitchell Criterion
Industrial experimental designs frequently lack optimal space-filling properties, rendering them unrepresentative. This study presents a comprehensive methodology to refine existing designs by enhancing coverage quality while optimizing experimental outcomes. We discuss and analyse variants of the Morris-Mitchell criterion to quantify and improve spatial distributions. Based on potential theory, we analyze monotonicity properties and limitations of the Morris-Mitchell criteria. Practically, we implement a multi-objective optimization framework utilizing the Python packages spotdesirability and spotoptim. This framework uses desirability functions to combine surrogate-model predictions with space-filling enhancements into a unified score. Demonstrated through data from a compressor development case study, this approach optimizes performance objectives alongside design coverage. To facilitate implementation, we introduce novel infill-point diagnostics that visually guide the sequential placement of design points. This integrated methodology successfully bridges spatial theory with engineering application, balancing the crucial exploration and exploitation trade-off.
♻ ☆ Nested Extremum Seeking Converges to Stackelberg Equilibrium
The nested Extremum Seeking (nES) algorithm is a model-free optimization method that has been shown to converge to a neighborhood of a Nash equilibrium. In this work, we demonstrate that the same nES dynamics can instead be made to converge to a neighborhood of a Stackelberg (leader--follower) equilibrium by imposing a different scaling law on the algorithm's design parameters. For the two--level nested case, using Lie--bracket averaging and singular perturbation arguments, we provide a rigorous stability proof showing semi-global practical asymptotic convergence to a Stackelberg equilibrium under appropriate time-scale separation. The results reveal that equilibrium selection, Nash versus Stackelberg, depends not on modifying the closed-loop dynamics, but on the hierarchical scaling of design parameters and the induced time-scale structure. We demonstrate this effect using a simple quadratic example and the canonical Fish War game. The Stackelberg variant of nES provides a model-free framework for hierarchical optimization in multi-time-scale systems, with potential applications in power grids, networked dynamical systems, and tuning of particle accelerators.
♻ ☆ Diffusive Stochastic Master Equation (SME) with dispersive qubit/cavity coupling
A detailed analysis of the quantum diffusive Stochastic Master Equation (SME) for qubit/cavity systems with dispersive coupling is provided. This analysis incorporates classical input signals and output signals (measurement outcomes through homodyne/heterodyne detection). The dynamics of the qubit/cavity density operator is shown to converge exponentially towards a slow invariant manifold. This invariant manifold is parameterized by the density operator of a fictitious qubit governed by a quantum SME incorporating the classical input/output signals preserving complete-positivity and trace. The reduced cavity (resp. qubit) operator obtained by partial trace of the qubit (resp. cavity) is then given by a time-varying deterministic quantum channel from the density operator of this fictitious qubit. This formulation avoids non-Markovian descriptions where negative dephasing rates and detection efficiency exceeding temporary one are present in several publications. Extensions are considered where the qubit is replaced by any qudit dispersively coupled to an arbitrary set of cavity modes with collective input/output classical signals.
comment: Submitted
♻ ☆ One-dimensional optimisation of indefinite-weight principal eigenvalues under inhomogeneous Robin conditions and a Schrödinger-type perturbation
We study the minimisation of the positive principal eigenvalue for an indefinite-weight problem under inhomogeneous Robin boundary conditions. The model is motivated by diffusive logistic equations in spatially heterogeneous environments, where the weight describes allocatable favourable resources and the Robin parameters measure boundary loss. After recalling the variational setting and the bang--bang reduction, we solve the one-dimensional optimisation problem completely: the optimal favourable set is an interval, and we determine whether it attaches to the left endpoint, to the right endpoint, or lies in the interior according to explicit criteria involving the two boundary parameters. The classification is then verified numerically by a transfer--matrix shooting method. We also introduce a Schrödinger-type extension with a fixed nonnegative background potential. In the coercive case we establish the corresponding principal-eigenvalue and bang--bang results, and in one dimension with constant potential we prove a stability result showing that minimisers for small background potential converge to those of the unperturbed problem. This perturbed regime is likewise illustrated numerically.
comment: 24 pages, 10 figures
♻ ☆ Recommend-to-Match with Random Supply Rejections: Formulation, Approximation, and Analysis
Matching demand with supply in crowdsourcing logistics platforms must contend with uncertain worker participation. Motivated by this challenge, we study a two-stage "recommend-to-match" problem under stochastic supplier rejections, where each demand is initially recommended to multiple potential suppliers prior to final matching decisions. We formulate a stochastic optimization model that explicitly captures uncertain supplier acceptance behavior. For the special case with homogeneous and independent acceptance responses, an exact mixed-integer linear program and LP formulations are achievable, but the general problem does not admit an efficient formulation. Particularly, our analysis reveals that deterministic linear approximation methods can perform arbitrarily poorly in such settings. To overcome this limitation, we propose a new approximation approach based on a convex relaxation of the original problem that admits a mixed-integer exponential cone program (MIECP) formulation. We analyze the structural properties of this approximation and establish its parametric performance guarantees. We also characterize conditions under which it can dominate a deterministic approximation. Extensive experiments on synthetic data and real-world freight data validate the effectiveness of our approach. Our MIECP-based solution achieves near-optimal matching performance while reducing computation time by over 90% compared to benchmark methods, which makes it particularly promising for large-scale matching problems.
♻ ☆ Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise
We study stochastic nonconvex optimization under heavy-tailed noise. In this setting, the stochastic gradients only have bounded $p$-th central moment ($p$-BCM) for some $p \in (1,2]$. Building on the foundational work of Arjevani et al. (2022) in stochastic optimization, we establish tight sample complexity lower bounds for all first-order methods under \emph{relaxed} mean-squared smoothness ($q$-WAS) and $δ$-similarity ($(q, δ)$-S) assumptions, allowing any exponent $q \in [1,2]$ instead of the standard $q = 2$. These results substantially broaden the scope of existing lower bounds. To complement them, we show that Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR), a known algorithm, matches these bounds in expectation. Beyond expectation guarantees, we introduce a new algorithm, Double-Clipped NSGD-MVR, which allows the derivation of high-probability convergence rates under weaker assumptions than in previous works. Finally, for second-order methods with stochastic Hessians satisfying bounded $q$-th central moment assumptions for some exponent $q \in [1, 2]$ (allowing $q \neq p$), we establish sharper lower bounds than previous works while improving over Sadiev et al. (2025) (where only $p = q$ is considered) and yielding stronger convergence exponents. Together, these results provide a nearly complete complexity characterization of stochastic nonconvex optimization in heavy-tailed regimes.
♻ ☆ Generative AI on Wall Street -- Opportunities and Risk Controls
We give an overview on the emerging applications of GenAI in the financial industry, especially within investment banks. Inherent to these exciting opportunities is a new realm of risks that must be managed properly. By heeding both the Yin and Yang sides of GenAI, we can accelerate its organic growth while safeguarding the entire financial industry during this nascent era of AI.
comment: 30 pages, 8 figures
♻ ☆ Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. A new existence result is established for the existence of optimal policies in general MDPs, which differs from the existence result derived previously in the literature. Using the well-established perturbation theory of linear operators, policy difference lemma is established for general MDPs and the Gauteaux derivative of the objective function as a function of the policy operator is derived. By upper bounding the policy difference via the theory of integral probability metric, a new majorization-minimization type policy gradient algorithm for general MDPs is derived. This leads to generalization of many well-known algorithms in reinforcement learning to cases with general state and action spaces. Further, by taking the integral probability metric as maximum mean discrepancy, a low-complexity policy gradient algorithm is derived for finite MDPs. The new algorithm, called MM-RKHS, appears to be superior to PPO algorithm due to low computational complexity, low sample complexity, and faster convergence.
♻ ☆ Smooth Quasar-Convex Optimization with Constraints
Quasar-convex functions form a broad nonconvex class with applications to linear dynamical systems, generalized linear models, and Riemannian optimization, among others. Current nearly optimal algorithms work only in affine spaces due to the loss of one degree of freedom when working with general convex constraints. Obtaining an accelerated algorithm that makes nearly optimal $\widetilde{O}(1/(γ\sqrt{\varepsilon}))$ first-order queries to a $γ$-quasar convex smooth function \emph{with constraints} was independently asked as an open problem in Martínez-Rubio (2022); Lezane, Langer, and Koolen (2024). In this work, we solve this question by designing an inexact accelerated proximal point algorithm that we implement using a first-order method achieving the aforementioned rate and, as a consequence, we improve the complexity of the accelerated geodesically Riemannian optimization solution in Martínez-Rubio (2022). We also analyze projected gradient descent and Frank-Wolfe algorithms in this constrained quasar-convex setting. To the best of our knowledge, our work provides the first analyses of first-order methods for quasar-convex smooth functions with general convex constraints.
comment: AISTATS 2026 final version
♻ ☆ Joint Cooperative and Non-Cooperative Localization in WSNs with Distributed Scaled Proximal ADMM Algorithms
The integration of cooperative and non-cooperative localization is fundamentally important, as these two modes frequently coexist in wireless sensor networks, especially when sensor positions are uncertain and targets are unable to communicate with the network. This paper presents a joint modeling approach that formulates cooperative and non-cooperative localization as a single optimization problem. By processing both tasks jointly, the proposed method eliminates the latency inherent in sequential approaches that perform cooperative localization first, followed by non-cooperative localization. However, this joint formulation introduces complex variable coupling, posing challenges in both modeling and optimization. To address this coupling, we introduce auxiliary variables that enable structural decoupling and facilitate distributed computation. Building on this formulation, we develop the Scaled Proximal Alternating Direction Method of Multipliers for Joint Cooperative and Non-Cooperative Localization (SP-ADMM-JCNL). Leveraging the structured design of the problem, we provide theoretical guarantees that the algorithm generates a sequence converging globally to a KKT point of the reformulated problem, and further to a critical point of the original non-convex objective function, with the convergence rate of O(1/T). Experiments demonstrate that SP-ADMM-JCNL achieves accurate and reliable localization performance.
♻ ☆ Quadratic Gradient: A Unified Framework Bridging Gradient Descent and Newton-Type Methods by Synthesizing Hessians and Gradients
Accelerating the convergence of second-order optimization, particularly Newton-type methods, remains a pivotal challenge in algorithmic research. In this paper, we extend previous work on the \textbf{Quadratic Gradient (QG)} and rigorously validate its applicability to general convex numerical optimization problems. We introduce a novel variant of the Quadratic Gradient that departs from the conventional fixed Hessian Newton framework. We present a new way to build a new version of the quadratic gradient. This new quadratic gradient doesn't satisfy the convergence conditions of the fixed Hessian Newton's method. However, experimental results show that it sometimes has a better performance than the original one in convergence rate. While this variant relaxes certain classical convergence constraints, it maintains a positive-definite Hessian proxy and demonstrates comparable, or in some cases superior, empirical performance in convergence rates. Furthermore, we demonstrate that both the original and the proposed QG variants can be effectively applied to non-convex optimization landscapes. A key motivation of our work is the limitation of traditional scalar learning rates. We argue that a diagonal matrix can more effectively accelerate gradient elements at heterogeneous rates. Our findings establish the Quadratic Gradient as a versatile and potent framework for modern optimization. Furthermore, we integrate Hutchinson's Estimator to estimate the Hessian diagonal efficiently via Hessian-vector products. Notably, we demonstrate that the proposed Quadratic Gradient variant is highly effective for Deep Learning architectures, providing a robust second-order alternative to standard adaptive optimizers.
comment: In this work, we proposed an enhanced Adam method via quadratic gradient and applied the quadratic gradient to the general numerical optimization problems. The quadratic gradient can indeed be used to build enhanced gradient methods for general optimization problems. There is a good chance that quadratic gradient can also be applied to quasi-Newton methods, such as the famous BFGS method
♻ ☆ On the importance of smoothness, interface resolution and numerical sensitivities in shape and topological sensitivity analysis
In this paper we investigate the influence of the discretization of PDE constraints on shape and topological derivatives. To this end, we study a tracking-type functional and a two-material Poisson problem in one spatial dimension. We consider the discretization by a standard method and an enriched method. In the standard method we use splines of degree $p$ such that we can control the smoothness of the basis functions easily, but do not take any interface location into consideration. This includes for p=1 the usual hat basis functions. In the enriched method we additionally capture the interface locations in the ansatz space by enrichment functions. For both discretization methods shape and topological sensitivity analysis is performed. It turns out that the regularity of the shape derivative depends on the regularity of the basis functions. Furthermore, for point-wise convergence of the shape derivative the interface has to be considered in the ansatz space. For the topological derivative we show that only the enriched method converges.
♻ ☆ QOCO: A Quadratic Objective Conic Optimizer with Custom Solver Generation
Second-order cone programs (SOCPs) with quadratic objective functions are common in optimal control and other fields. Most SOCP solvers which use interior-point methods are designed for linear objectives and convert quadratic objectives into linear ones via slack variables and extra constraints, despite the computational advantages of handling quadratic objectives directly. In applications like model-predictive control and online trajectory optimization, these SOCPs have known sparsity structures and require rapid solutions. When solving these problems, most solvers use sparse linear algebra routines, which introduce computational overhead and hinder performance. In contrast, custom linear algebra routines can exploit the known sparsity structure of problem data and be significantly faster. This work makes two key contributions: (1) the development of QOCO, an open-source C-based solver for quadratic objective SOCPs, and (2) the introduction of QOCOGEN, an open-source custom solver generator for quadratic objective SOCPs, which generates a solver written in C that leverages custom linear algebra. Both implement a primal-dual interior-point method with Mehrotra's predictor-corrector. Our benchmarks show that QOCO is faster and more robust than many commonly used solvers, and solvers generated by QOCOGEN are significantly faster than QOCO and are free of dynamic memory allocation making them an attractive option for real-time optimization on resource-constrained embedded systems.
comment: Published in Mathematical Programming Computation
Optimization and Control 46
☆ Scalable Co-Design via Linear Design Problems: Compositional Theory and Algorithms
Designing complex engineered systems requires managing tightly coupled trade-offs between subsystem capabilities and resource requirements. Monotone co-design provides a compositional language for such problems, but its generality does not by itself reveal which problem classes admit exact and scalable computation. This paper isolates such a class by introducing Linear Design Problems (LDPs): design problems whose feasible functionality--resource relations are polyhedra over Euclidean posets. We show that queries on LDPs reduce exactly to Multi-Objective Linear Programs (MOLPs), thereby connecting monotone co-design semantics with polyhedral multiobjective optimization. We further prove that LDPs are closed under the fundamental co-design interconnections, implying that any interconnection of linear components induces a system-level LDP. To compute the resulting feasible sets, we develop two complementary constructions: a monolithic lifted formulation that preserves block-angular sparsity, and a compositional formulation that incrementally eliminates internal variables through polyhedral projection. Beyond the exact linear setting, we show that convex co-design resource queries admit arbitrarily accurate polyhedral outer approximations, with recession-cone error identically zero for standard nonnegative resource cones. Numerical studies on synthetic series-chain benchmarks, a gripper, and a rover co-design validate the theory.
comment: 17 pages, 7 figures, 4 tables
☆ Stable Walking for Bipedal Locomotion under Foot-Slip via Virtual Nonholonomic Constraints
Foot slip is a major source of instability in bipedal locomotion on low-friction or uncertain terrain. Standard control approaches typically assume no-slip contact and therefore degrade when slip occurs. We propose a control framework that explicitly incorporates slip into the locomotion model through virtual nonholonomic constraints, which regulate the tangential stance-foot velocity while remaining compatible with the virtual holonomic constraints used to generate the walking gait. The resulting closed-loop system is formulated as a hybrid dynamical system with continuous swing dynamics and discrete impact events. A nonlinear feedback law enforces both classes of constraints and yields a slip-compatible hybrid zero dynamics manifold for the reduced-order locomotion dynamics. Stability of periodic walking gaits is characterized through the associated Poincaré map, and numerical results illustrate stabilization under slip conditions.
☆ A user-driven pricing and scheduling framework for public electric vehicle charging
Public electric vehicle (EV) charging infrastructure has expanded rapidly, yet utilization across charging stations remains uneven and often inefficient. Existing operator-determined pricing schemes offer limited flexibility to coordinate heterogeneous user demand within constrained capacity. This study proposes a user-driven pricing and scheduling framework for public EV charging. Users submit advance bids specifying acceptable time slots, bid prices, and quantity bounds. Based on these bids, the charging operator determines prices and charging slot assignments. After observing the outcomes, users decide whether to accept the resulting prices and allocations. The operator's decision problem incorporates profit objectives, user participation requirements, and capacity constraints across charging levels and time slots. The framework captures a three-stage interaction involving user bidding, operator decisions, and user acceptance. Numerical case studies reveal trade-offs among user acceptance, operator revenue, and charging capacity utilization under different charging levels. The findings provide practical guidance and insights into designing flexible pricing schemes that better accommodate heterogeneous user preferences while improving system efficiency.
☆ Transfer Learning in Bayesian Optimization for Aircraft Design
The use of transfer learning within Bayesian optimization addresses the disadvantages of the so-called \textit{cold start} problem by using source data to aid in the optimization of a target problem. We present a method that leverages an ensemble of surrogate models using transfer learning and integrates it in a constrained Bayesian optimization framework. We identify challenges particular to aircraft design optimization related to heterogeneous design variables and constraints. We propose the use of a partial-least-squares dimension reduction algorithm to address design space heterogeneity, and a \textit{meta} data surrogate selection method to address constraint heterogeneity. Numerical benchmark problems and an aircraft conceptual design optimization problem are used to demonstrate the proposed methods. Results show significant improvement in convergence in early optimization iterations compared to standard Bayesian optimization, with improved prediction accuracy for both objective and constraint surrogate models.
Collisionless Multi-Agent Path Planning in the Hamilton-Jacobi Formulation
We present a method for collisionless multi-agent path planning using the Hamilton-Jacobi-Bellman equation. Because the method is rooted in optimal control theory and partial differential equations, it avoids the need for hierarchical planners and is black-box free. Our model can account for heterogeneous agents and realistic, high-dimensional dynamics. We develop a grid-free numerical method based on a variational formulation of the solution of the Hamilton-Jacobi-Bellman equation which can resolve optimal trajectories even in high-dimensional problems, and include some practical implementation notes. In particular, we resolve the solution using a primal-dual hybrid gradient optimization scheme. We demonstrate the method's efficacy on path planning problems involving simple cars and quadcopter drones.
☆ Multi-fidelity approaches for general constrained Bayesian optimization with application to aircraft design
Aircraft design relies heavily on solving challenging and computationally expensive Multidisciplinary Design Optimization problems. In this context, there has been growing interest in multi-fidelity models for Bayesian optimization to improve the MDO process by balancing computational cost and accuracy through the combination of high- and low-fidelity simulation models, enabling efficient exploration of the design process at a minimal computational effort. In the existing literature, fidelity selection focuses only on the objective function to decide how to integrate multiple fidelity levels, balancing precision and computational cost using variance reduction criteria. In this work, we propose novel multi-fidelity selection strategies. Specifically, we demonstrate how incorporating information from both the objective and the constraints can further reduce computational costs without compromising the optimality of the solution. We validate the proposed multi-fidelity optimization strategy by applying it to four analytical test cases, showcasing its effectiveness. The proposed method is used to efficiently solve a challenging aircraft wing aero-structural design problem. The proposed setting uses a linear vortex lattice method and a finite element method for the aerodynamic and structural analysis respectively. We show that employing our proposed multi-fidelity approach leads to $86\%$ to $200\%$ more constraint compliant solutions given a limited budget compared to the state-of-the-art approach.
☆ Beyond binarity: Semidefinite programming for ternary quadratic problems
We study the ternary quadratic problem (TQP), a quadratic optimization problem with linear constraints where the variables take values in $\{0, \pm 1\}$. While semidefinite programming (SDP) techniques are well established for $\{0,1\}$- and $\{\pm 1\}$-valued quadratic problems, no dedicated integer semidefinite programming framework exists for the ternary case. In this paper, we introduce a ternary SDP formulation for the TQP that forms the basis of an exact solution approach. We derive new theoretical insights in rank-one ternary positive semidefinite matrices, which lead to a basic SDP relaxation that is further strengthened by valid triangle, RLT, split and $k$-gonal inequalities. These are embedded in a tailored branch-and-bound algorithm that iteratively solves strengthened SDPs, separates violated inequalities, applies a ternary branching strategy and computes high-quality feasible solutions. We test our algorithm on TQP variations motivated by practice, including unconstrained, linearly constrained and quadratic ratio problems. Computational results on these instances demonstrate the effectiveness of the proposed algorithm.
☆ A Spectral Preconditioner for the Conjugate Gradient Method with Iteration Budget
We study the solution of large symmetric positive-definite linear systems in a matrix-free setting with a limited iteration budget. We focus on the preconditioned conjugate gradient (PCG) method with spectral preconditioning. Spectral preconditioners map a subset of eigenvalues to a positive cluster via a scaling parameter, and leave the remainder of the spectrum unchanged, in hopes to reduce the number of iterations to convergence. We formulate the design of the spectral preconditioners as a constrained optimization problem. The optimal cluster placement is defined to minimize the error in energy norm at a fixed iteration. This optimality criterion provides new insight into the design of efficient spectral preconditioners when PCG is stopped short of convergence. We propose practical strategies for selecting the scaling parameter, hence the cluster position, that incur negligible computational cost. Numerical experiments highlight the importance of cluster placement and demonstrate significant improvements in terms of error in energy norm, particularly during the initial iterations.
☆ Optimistic Online LQR via Intrinsic Rewards
Optimism in the face of uncertainty is a popular approach to balance exploration and exploitation in reinforcement learning. Here, we consider the online linear quadratic regulator (LQR) problem, i.e., to learn the LQR corresponding to an unknown linear dynamical system by adapting the control policy online based on closed-loop data collected during operation. In this work, we propose Intrinsic Rewards LQR (IR-LQR), an optimistic online LQR algorithm that applies the idea of intrinsic rewards originating from reinforcement learning and the concept of variance regularization to promote uncertainty-driven exploration. IR-LQR retains the structure of a standard LQR synthesis problem by only modifying the cost function, resulting in an intuitively pleasing, simple, computationally cheap, and efficient algorithm. This is in contrast to existing optimistic online LQR formulations that rely on more complicated iterative search algorithms or solve computationally demanding optimization problems. We show that IR-LQR achieves the optimal worst-case regret rate of $\sqrt{T}$, and compare it to various state-of-the-art online LQR algorithms via numerical experiments carried out on an aircraft pitch angle control and an unmanned aerial vehicle example.
☆ Symmetrizing Bregman Divergence on the Cone of Positive Definite Matrices: Which Mean to Use and Why
This work uncovers variational principles behind symmetrizing the Bregman divergences induced by generic mirror maps over the cone of positive definite matrices. We show that computing the canonical means for this symmetrization can be posed as minimizing the desired symmetrized divergences over a set of mean functionals defined axiomatically to satisfy certain properties. For the forward symmetrization, we prove that the arithmetic mean over the primal space is canonical for any mirror map over the positive definite cone. For the reverse symmetrization, we show that the canonical mean is the arithmetic mean over the dual space, pulled back to the primal space. Applying this result to three common mirror maps used in practice, we show that the canonical means for reverse symmetrization, in those cases, turn out to be the arithmetic, log-Euclidean and harmonic means. Our results improve understanding of existing symmetrization practices in the literature, and can be seen as a navigational chart to help decide which mean to use when.
☆ H Infinity Minimal Destabilizing Feedback for Vulnerability Analysis and Attack Design of Nonlinear Systems
The robust stability problem involves designing a controlled system which remains stable in the presence of modeling uncertainty. In this context, results known as small gain theorems are used to quantify the maximum amount of uncertainty for which stability is guaranteed. These notions inform the design of numerous control systems, including critical infrastructure components such as power grids, gas pipelines, and water systems. However, these same concepts can be used by an adversary to design a malicious feedback attack, of minimal size, to drive the closed-loop system to instability. In this paper, we first present a detailed review of the results in robust control which allow for the construction of minimal destabilizers. These minimally sized attacks merely push the system to the stability boundary, which we demonstrate do not necessarily destabilize nonlinear systems even when the linearization is destabilized. Our main result leverages linear perturbation theory to explicitly prove, in the state space context, that internal destabilization is guaranteed for a broad class of nonlinear systems when the gain of these attacks is slightly increased.
comment: Submitted to LCSS-CDC 2026
☆ Constrained Optimization on Matrix Lie Groups via Interior-Point Method
This paper proposes an interior-point framework for constrained optimization problems whose decision variables evolve on matrix Lie groups. The proposed method, termed the Matrix Lie Group Interior-Point Method (MLG-IPM), operates directly on the group structure using a minimal Lie algebra parametrization, avoiding redundant matrix representations and eliminating explicit dependence on Riemannian metrics. A primal-dual formulation is developed in which the Newton system is constructed through sensitivity and curvature matrices. Also, multiplicative updates are performed via the exponential map, ensuring intrinsic feasibility with respect to the group structure while maintaining strict positivity of slack and dual variables through a barrier strategy. A local analysis establishes quadratic convergence under standard regularity assumptions and characterizes the behavior under inexact Newton steps. Statistical comparisons against Riemannian Interior-Point Methods, specifically for optimization problems defined over the Special Orthogonal Group SO(n) and Special Linear Group SL(n), demonstrate that the proposed approach achieves higher success rates, fewer iterations, and superior numerical accuracy. Furthermore, its robustness under perturbations suggests that this method serves as a consistent and reliable alternative for structured manifold optimization.
comment: This is a preprint submitted to IEEE Control Systems Letters
☆ On the Convergence of Proximal Algorithms for Weakly-convex Min-max Optimization
We study alternating first-order algorithms with no inner loops for solving nonconvex-strongly-concave min-max problems. We show the convergence of the alternating gradient descent--ascent algorithm method by proposing a substantially simplified proof compared to previous ones. It allows us to enlarge the set of admissible step-sizes. Building on this general reformulation, we also prove the convergence of a doubly proximal algorithm in the weakly convex-strongly concave setting. Finally, we show how this new result opens the way to new applications of min-max optimization algorithms for solving regularized imaging inverse problems with neural networks in a plug-and-play manner.
☆ Tracking controllability on moving targets for parabolic equations
In this paper, we study the tracking controllability of a 1D parabolic type equation. Notably, with controls acting on the boundary, we seek to approximately control the solution of the equation on specific points of the domain. We prove that acting on one boundary point, we control the solution on one target point, whereas acting on two boundary points, we can control the solution on up to two target points. In order to do so, when the target is fixed, we study the controllability by minimizing the corresponding problem with duality results. Afterwards, we study the controllability on moving points by applying a transformation that takes the problem to a fixed target. Lastly, we also solve some of these control problems numerically and compute approximations of the solutions and the desired targets, which validates our theoretical methodology.
comment: 28 pages, 14 figures
☆ Yau's Affine Normal Descent: Algorithmic Framework and Convergence Analysis
We propose Yau's Affine Normal Descent (YAND), a geometric framework for smooth unconstrained optimization in which search directions are defined by the equi-affine normal of level-set hypersurfaces. The resulting directions are invariant under volume-preserving affine transformations and intrinsically adapt to anisotropic curvature. Using the analytic representation of the affine normal from affine differential geometry, we establish its equivalence with the classical slice-centroid construction under convexity. For strictly convex quadratic objectives, affine-normal directions are collinear with Newton directions, implying one-step convergence under exact line search. For general smooth (possibly nonconvex) objectives, we characterize precisely when affine-normal directions yield strict descent and develop a line-search-based YAND. We establish global convergence under standard smoothness assumptions, linear convergence under strong convexity and Polyak-Lojasiewicz conditions, and quadratic local convergence near nondegenerate minimizers. We further show that affine-normal directions are robust under affine scalings, remaining insensitive to arbitrarily ill-conditioned transformations. Numerical experiments illustrate the geometric behavior of the method and its robustness under strong anisotropic scaling.
comment: 55 pages, 25 figures
☆ Nonlinear Trajectory Optimization Models for Energy-Sharing UAV-UGV Systems with Multiple Task Locations
Energy-sharing UAV-UGV systems extend the endurance of Uncrewed Aerial Vehicles (UAVs) by leveraging Uncrewed Ground Vehicles (UGVs) as mobile charging stations, enabling persistent autonomy in infrastructure-sparse environments. Trajectory optimization for these systems is often challenging due to UGVs' terrain access constraints and the discrete nature of task scheduling. We propose a smooth nonlinear program model for the joint trajectory optimization for these systems. Unlike existing models, the proposed model allows smooth parameterization of UGVs' terrain access constraints and supports partial UAV recharging. Further, it introduces a smooth approximation of disjunctive constraints that eliminates the need for computationally expensive integer programming and enables efficient solutions via nonlinear programming algorithms. We demonstrate the proposed model on a one-UAV-one-UGV system with multiple task locations. Compared with mixed-integer nonlinear programs, this model reduces the computation time by orders of magnitude.
☆ Optimal control with the shifted proper orthogonal decomposition via a first-reduce-then-optimize framework
Solving optimal control problems for transport-dominated partial differential equations (PDEs) can become computationally expensive, especially when dealing with high-dimensional systems. To overcome this challenge, we focus on developing and deriving reduced-order models that can replace the full PDE system in solving the optimal control problem. Specifically, we explore the use of the shifted proper orthogonal decomposition (POD) as a reduced-order model, which is particularly effective for capturing low-dimensional representations of high-fidelity transport-dominated phenomena. In this work, a reduced-order model is constructed first, followed by the optimization of the reduced system. We consider a 1D linear advection equation problem and prove existence and uniqueness of solutions for the reduced-order model as well as the existence of an optimal control. Moreover, we compare the computational performance of the shifted POD method against the standard POD.
☆ Bundle EXTRA for Decentralized Optimization
Decentralized primal-dual methods are widely used for solving decentralized optimization problems, but their updates often rely on the potentially crude first-order Taylor approximations of the objective functions, which can limit convergence speed. To overcome this, we replace the first-order Taylor approximation in the primal update of EXTRA, which can be interpreted as a primal-dual method, with a more accurate multi-cut bundle model, resulting in a fully decentralized bundle EXTRA method. The bundle model incorporates historical information to improve the approximation accuracy, potentially leading to faster convergence. Under mild assumptions, we show that a KKT residual converges to zero. Numerical experiments on decentralized least-squares problems demonstrate that, compared to EXTRA, the bundle EXTRA method converges faster and is more robust to step-size choices.
☆ Efficient modeling of chemotherapy regimens using mixed-integer linear programming
In this article, we focus on determining a minimum-cost treatment program aimed at maintaining the size of a cancerous tumor at a level that allows the patient to live comfortably. At each predetermined point in a treatment horizon, the patient either receives drug treatment or does not. In the first case, the tumor shrinks and its size is multiplied by a constant factor lower than 1; in the second, it grows following an exponential or Gompertz growth law. We first demonstrate that a simple heuristic solution provides an optimal treatment program. We then show that the Gompertz function can be described, like the exponential function, by a simple recurrence relation that does not explicitly depend on time. Thanks to the characteristics of the logarithmic function, this property allows us to formulate the problem as a mixed-integer linear program. This result makes it possible to solve the problem very efficiently using one of the many solvers available to handle this type of program, and above all to consider and solve several extensions to the problem. In particular, we show how to determine which of the optimal equivalent solutions to the initial problem are the most relevant. We also show how to measure the effect of a marginal increase in the treatment budget on patient quality of life. Numerous computational experiments are presented to illustrate these issues.
☆ Critical phase transitions in minimum-energy configurations for the exponential kernel family $e^{-|x-y|^q}$ on the unit interval
We study the optimal placement of $k$ ordered points on the unit interval for the bounded pair potential \[ K_q(d)=e^{-d^q}, \qquad q>0. \] The family interpolates between strongly cusp-like kernels for $01$. Our emphasis is on the transition from collision-free minimizers to endpoint-collapsed minimizers. We reformulate the problem in gap variables, record convexity, symmetry, and the Karush-Kuhn-Tucker conditions, and give a short proof that collisions are impossible for $01$ we identify critical exponents $q_k$ beyond which interior points are no longer optimal. For odd $k$ we derive the exact universal value \[ q_{2m+1} = \frac{\log(1/(-\log((1+e^{-1})/2)))}{\log 2} \approx 1.396363475, \] and for even $k=4,6,\dots,20$ we compute the numerical transition values \[ \begin{aligned} &q_4\approx 1.062682507,\quad q_6\approx 1.155601329,\quad q_8\approx 1.206132611,\quad q_{10}\approx 1.238523533,\\ &q_{12}\approx 1.261308114,\quad q_{14}\approx 1.278305167,\quad q_{16}\approx 1.291510874,\quad q_{18}\approx 1.302082885,\\ &q_{20}\approx 1.310744185. \end{aligned} \] We also include comparison tables and diagrams for the kernels $e^{-\sqrt d}$, $e^{-d}$, and $e^{-d^2}$, briefly relate the bounded family to the singular Riesz kernel $d^{-s}$, and identify the $q\to 0^+$ limit with the Fekete/Chebyshev--Lobatto configuration on $[0,1]$.
comment: 13 pages, 4 figures
☆ Structure and symmetry of the Gross-Pitaevskii ground-state manifold
The structure and degeneracy of ground states of the Gross-Pitaevskii energy functional play a central role in both analysis and computation, yet a characterization of the ground-state manifold in the presence of symmetries remains a fundamental challenge. In this paper, we establish sharp results describing the geometric structure of local minimizers and its implications for optimization algorithms. We show that when local minimizers are non-unique, the Morse-Bott condition provides a natural and sufficient criterion under which the ground-state set partitions into finitely many embedded submanifolds, each coinciding with an orbit generated by the intrinsic symmetries of the energy functional, namely phase shifts and spatial rotations. This yields a structural characterization of the ground-state manifold purely in terms of these natural symmetries. Building on this geometric insight, we analyze the local convergence behavior of the preconditioned Riemannian gradient method (P-RG). Under the Morse-Bott condition, we derive the optimal local $Q$-linear convergence rate and prove that the condition holds if and only if the energy sequence generated by P-RG converges locally $Q$-linearly. In particular, on the ground-state set, the Morse-Bott condition is satisfied if and only if the minimizers decompose into finitely many symmetry orbits and the P-RG exhibits local linear convergence in a neighborhood of this set. When the condition fails, we establish a local sublinear convergence rate. Taken together, these results provide a precise picture: for the Gross-Pitaevskii minimization problem, the Morse-Bott condition acts as the exact threshold separating linear from sublinear convergence, while simultaneously determining the symmetry-induced structure of the ground-state manifold. Our analysis thus connects geometric structure, symmetry, and algorithmic performance in a unified framework.
☆ Topology Optimization of Cooling Channels Using Dual-Type Moving Morphable Components
Efficient thermal management in high-power electronic devices requires cooling channel designs that provide high heat removal while satisfying strict spatial and manufacturing constraints. This study presents a two-stage hierarchical topology optimization framework for cooling channels based on the Moving Morphable Components (MMC) method. The optimization is performed sequentially: in the first stage, only wall components are optimized to establish the global flow network and insignificant components are removed; in the second stage, the global structure is fixed and fin components are optimized to improve local thermal performance. The method is coupled with a two-layer thermofluid model using the Brinkman approximation and solved with the adjoint sensitivity approach. Across multiple inlet pressure conditions, the proposed framework consistently generates designs with clear functional separation. The results demonstrate that exploring such clearly separated structures through a two-stage optimization strategy leads to a further reduction in the objective function. Compared with simultaneous MMC optimization and conventional density-based topology optimization, the proposed method produces geometries that are more interpretable, controllable, and suitable for manufacturing.
comment: 23 pages, 16 figures. Submitted as a LaTeX source file with figures
☆ Quantum-inspired Tensor Network for QUBO, QUDO and Tensor QUDO Problems with k-neighbors
This work presents a novel tensor network algorithm for solving Quadratic Unconstrained Binary Optimization (QUBO) problems, Quadratic Unconstrained Discrete Optimization (QUDO) problems, and Tensor Quadratic Unconstrained Discrete Optimization (T-QUDO) problems. The proposed algorithm is based on the MeLoCoToN methodology, which solves combinatorial optimization problems by employing superposition, imaginary time evolution, and projective measurements. Additionally, two different approaches are presented to solve QUBO and QUDO problems with k-neighbors interactions in a lineal chain, one based on 4-order tensor contraction and the other based on matrix-vector multiplication, including sparse computation and a new technique called "Waterfall". Furthermore, the performance of both implementations is compared with a quadratic optimization solver to demonstrate the performance of the method, showing advantages in several problem instances.
comment: 14 pages, 16 figures
☆ Input-to-state stabilization of linear systems under data-rate constraints
We study feedback stabilization of continuous-time linear systems under finite data-rate constraints in the presence of unknown disturbances. A communication and control strategy based on sampled and quantized state measurements is proposed, where the quantization range is dynamically adjusted using reachable-set propagation and disturbance estimates derived from quantization parameters. The strategy alternates between stabilizing and searching stages to handle escapes from the quantization range and employs an additional quantization symbol to ensure robustness near the equilibrium. It guarantees input-to-state stability (ISS), improving upon existing results that yield only practical ISS or lack explicit data-rate conditions. Simulation results illustrate the effectiveness of the strategy.
☆ A Framework for Exploring Social Interactions in Multiagent Decision-Making for Two-Queue Systems
We introduce a new framework for multiagent decision-making in queueing systems that leverages the agility and robustness of nonlinear opinion dynamics to break indecision during queue selection and to capture the influence of social interactions on collective behavior. Queueing models are central to understanding multiagent behavior in service settings. Many prior models assume that each agent's decision-making process is optimization-based and governed by rational responses to changes in the queueing system. Instead, we introduce an internal opinion state, driven by nonlinear opinion dynamics, that represents the evolving strength of the agent's preference between two available queues. The opinion state is influenced by social interactions, which can modify purely rational responses. We propose a new subclass of queueing models in which each agent's behavioral decisions (e.g., joining or switching queues) are determined by this evolving opinion state. We prove a sufficient parameter condition that guarantees the Markov chain describing the evolving opinion and queueing system states reaches the Nash equilibrium of an underlying congestion game in finite expected time. We then explore the richness of the new framework through numerical simulations that illustrate the role of social interactions and an individual's access to system information in shaping collective behavior.
comment: Submitted to the 2026 IEEE Conference on Decision and Control
☆ Tightening CVaR Approximations via Scenario-Wise Scaling for Chance-Constrained Programming
Chance-constrained programs (CCPs) provide a powerful modeling framework for decision-making under uncertainty, but their nonconvex feasible regions make them computationally challenging. A widely used convex inner approximation replaces chance constraints with Conditional Value-at-Risk (CVaR) constraints; however, the resulting solutions can be overly conservative and suboptimal. We propose a scenario-wise scaling approach that strengthens CVaR approximations for CCPs with finitely supported uncertainty. The method introduces scaling factors that reweight individual scenarios within the CVaR constraint, yielding a family of potentially tighter inner approximations. We establish sufficient conditions under which, for a suitable choice of scaling factors, the scaled CVaR approximation attains the same optimal value as the original CCP and admits a (near-)optimal solution of the CCP. We show that these conditions are tight and further relax them in the convex setting. We also show that optimizing over scenario-wise scaling factors is NP-hard. To address this computational challenge, we develop efficient heuristic and sequential convex approximation algorithms that iteratively update the scaling factors and generate improved feasible solutions. Numerical experiments demonstrate that the proposed methods consistently improve upon standard CVaR and state-of-the-art convex approximations, often reducing conservativeness while maintaining tractability.
♻ ☆ Green-LLM: Optimal Workload Allocation for Environmentally-Aware Distributed Inference
This letter investigates the optimal allocation of large language model (LLM) inference workloads across heterogeneous edge data centers (DCs) over time. Each DC features on-site renewable generation and faces dynamic electricity prices and spatiotemporal variability in renewable availability. The central question is: how can inference workloads be optimally distributed to the DCs to minimize energy consumption, carbon emissions, and water usage while enhancing user experience? This letter proposes a novel optimization model for LLM service providers to reduce operational costs and environmental impacts. Numerical results validate the efficacy of the proposed approach.
comment: 5 pages, 11 figures
♻ ☆ Tightening convex relaxations of trained neural networks: a unified approach for convex and S-shaped activations
The non-convex nature of trained neural networks has created significant obstacles in their incorporation into optimization models. In this context, Anderson et al. (2020) provided a framework to obtain the convex hull of the graph of a piecewise linear convex activation function composed with an affine function; this effectively convexifies activations such as the ReLU together with the affine transformation that precedes it. In this article, we contribute to this line of work by developing a recursive formula that yields a tight convexification for the composition of an activation with an affine function for a wide scope of activation functions, namely, convex or ``S-shaped". Our approach can be used to efficiently compute separating hyperplanes or determine that none exists in various settings, including non-polyhedral cases.
♻ ☆ Equitable Congestion Pricing under the Markovian Traffic Model: An Application to Bogota
Congestion pricing is used to raise revenues and reduce traffic and pollution. However, people have heterogeneous spatial demand patterns and willingness (or ability) to pay tolls, and so pricing may have substantial equity implications. We develop a data-driven approach to design congestion pricing given policymakers' equity and efficiency objectives. First, algorithmically, we extend the Markovian traffic equilibrium setting introduced by Baillon & Cominetti (2008) to model heterogeneous populations and incorporate prices and outside options such as public transit. In this setting, we show that a unique equilibrium exists. Second, via a detailed case study, we empirically evaluate various pricing schemes using data collected by an industry partner in the city of Bogota, one of the most congested cities in the world. We find that pricing personalized to each economic stratum can be substantially more efficient and equitable than uniform pricing; however, non-personalized but area-based pricing can recover much of the gap.
♻ ☆ Single-Asset Adaptive Leveraged Volatility Control
This paper introduces a methodology for constructing a market index composed of a liquid risky asset and a liquid risk-free asset that achieves a fixed target volatility. Existing volatility-targeting strategies typically scale portfolio exposure inversely with a variance forecast, but such open-loop approaches suffer from high turnover, leverage spikes, and sensitivity to estimation error -- issues that limit practical adoption in index construction. We propose a proportional-control approach for setting the index weights that explicitly corrects tracking error through feedback. The method requires only a few interpretable parameters, making it transparent and practical for index construction. We demonstrate in simulation that this approach is more effective at consistently achieving the target volatility than the open-loop alternative.
comment: 16 pages, 7 figures
♻ ☆ Optimal threshold resetting in collective diffusive search
Stochastic resetting has attracted significant attention in recent years due to its wide-ranging applications across physics, biology, and search processes. In most existing studies, however, resetting events are governed by an external timer and remain decoupled from the system's intrinsic dynamics. In a recent Letter by Biswas et al, we introduced threshold resetting (TR) as an alternative, event-driven optimization strategy for target search problems. Under TR, the entire process is reset whenever any searcher reaches a prescribed threshold, thereby coupling the resetting mechanism directly to the internal dynamics. In this work, we study TR-enabled search by $N$ non-interacting diffusive searchers in a one-dimensional box $[0,L]$, with the target at the origin and the threshold at $L$. By optimally tuning the scaled threshold distance $u = x_0/L$, the mean first-passage time can be significantly reduced for $N \geq 2$. We identify a critical population size $N_c(u)$ below which TR outperforms reset-free dynamics. Furthermore, for fixed $u$, the mean first-passage time depends non-monotonically on $N$, attaining a minimum at $N_{\mathrm{opt}}(u)$. We also quantify the achievable speed-up and analyze the operational cost of TR, revealing a nontrivial optimization landscape. These findings highlight threshold resetting as an efficient and realistic optimization mechanism for complex stochastic search processes.
comment: 19 pages, 6 figures
♻ ☆ From oblique-wave forcing to streak reinforcement: A perturbation-based frequency-response framework
We develop a perturbation-based frequency-response framework for analyzing amplification mechanisms that are central to subcritical routes to transition in wall-bounded shear flows. By systematically expanding the input-output dynamics of fluctuations about the laminar base flow with respect to forcing amplitude, we establish a rigorous correspondence between linear resolvent analysis and higher-order nonlinear interactions. At second order, quadratic interactions of unsteady oblique waves generate steady streamwise streaks via the lift-up mechanism. We demonstrate that the spatial structure of these streaks is captured by the second output singular function of the streamwise-constant resolvent operator. At higher orders, nonlinear coupling between oblique waves and induced streaks acts as structured forcing of the laminar linearized dynamics, yielding additional streak components whose relative phase governs reinforcement or attenuation of the leading-order streak response. Our analysis identifies a critical forcing amplitude marking the breakdown of the weakly nonlinear regime, beyond which direct numerical simulations exhibit sustained unsteadiness. We show that this breakdown coincides with the onset of secondary instability, revealing that the nonlinear interactions responsible for streak formation also drive the modal growth central to classical transition theory. The resulting framework provides a mechanistically transparent and computationally efficient description of transition that unifies non-modal amplification, streak formation, and modal instability within a single formulation derived directly from the Navier-Stokes equations.
comment: 39 pages, 29 figures
♻ ☆ On the Controllability of a Fully Nonlocal Coupled Stochastic Reaction--Convection--Diffusion System
In this paper, we study the null and approximate controllability of a class of fully nonlocal coupled stochastic reaction--convection--diffusion systems. The system consists of two forward stochastic parabolic equations driven by general second-order differential operators and incorporates four nonlocal zero-order integral terms. The nonlocality arises from integral kernel terms present in both equations, defined over a bounded domain $G \subset \mathbb{R}^N$ ($N \geq 1$). Since the coefficients depend on time, space, and random variables, we introduce three controls: a spatially localized control acting on the drift term of the first equation, and two additional controls acting on the diffusion terms of both equations. These additional controls are necessary to overcome difficulties due to the stochastic nature of the associated adjoint backward system. Using a standard duality argument, the controllability problem for the forward system is reduced to an observability problem for the corresponding adjoint nonlocal backward system. To establish this observability, we derive a new global Carleman estimate for the adjoint system, in which the drift terms belong to a negative Sobolev space and the equations include nonlocal integral terms. Our results are obtained under suitable cascade structure conditions on the coupling zero-order, nonlocal, and first-order terms of the system.
♻ ☆ A new fuzzy fractional differential variational inequality with integral boundary conditions
This paper considers a new fuzzy fractional differential variational inequality with integral boundary conditions comprising a fuzzy fractional differential inclusion with integral boundary conditions and a variational inequality in Euclidean spaces. Such a model captures the desired features of both fuzzy fractional differential inclusions with integral boundary conditions and fractional differential variational inequalities within the same framework. The existence of solutions for such a novel system is obtained under different conditions. Moreover, a numerical example is provided to illustrate our abstract results.
♻ ☆ A Solution Concept for Convex Vector Optimization Problems based on a User-defined Region of Interest
This work addresses arbitrary convex vector optimization problems, which constitute a general framework for multi-criteria decision-making in diverse real-world applications. Due to their complexity, such problems are typically tackled using polyhedral approximation. Existing solution concepts rely on additional assumptions, such as boundedness, polyhedrality of the ordering cone, or existence of interior points in the ordering cone, and typically focus on absolute error measures. We introduce a solution concept based on the homogenization of the upper image that employs relative error measures and avoids additional structural assumptions. Although minimality is not explicitly required, a form of approximate minimality is implicitly ensured. The concept is straightforward, requiring only a single precision parameter and, owing to relative errors, remains robust under scaling of the target functions. Homogenization also eliminates the need for the binary distinction between points far from the origin and directions which can lead to numerical difficulties. Furthermore, in practice decision-makers often identify a region where preferred solutions are expected. Our concept supports both a global overview of the upper image and a refined local perspective within such a user-defined region of interest (RoI).We present a decision-making procedure enabling iterative refinement of this region and the associated preferences.
comment: 22 pages, 13 figures
♻ ☆ An adaptive proximal safeguarded augmented Lagrangian method for nonsmooth DC problems with convex constraints
A proximal safeguarded augmented Lagrangian method for minimizing the difference of convex (DC) functions over a nonempty, closed and convex set with additional linear equality as well as convex inequality constraints is presented. Thereby, all functions involved may be nonsmooth. Iterates (of the primal variable) are obtained by solving convex optimization problems as the concave part of the objective function gets approximated by an affine linearization. Under the assumption of a modified Slater constraint qualification, both convergence of the primal and dual variables to a generalized Karush-Kuhn-Tucker (KKT) point is proven, at least on a subsequence. Numerical experiments and comparison with existing solution methods are presented using some classes of constrained and nonsmooth DC problems.
comment: 38 pages,4 figures; final version: proof of Lemma 3.3 has been restructured
♻ ☆ Atomic Gradient Flows: Gradient Flows on Sparse Representations
One of the most popular approaches for solving total variation-regularized optimization problems in the space of measures are Particle Gradient Flows (PGFs). These restrict the problem to linear combinations of Dirac deltas and then perform a Euclidean gradient flow in the weights and positions, significantly reducing the computational cost while still decreasing the energy. In this work, we generalize PGFs to convex optimization problems in arbitrary Banach spaces, which we call Atomic Gradient Flows (AGFs). To this end, the crucial ingredient turns out to be the right notion of particles, chosen here as the extremal points of the unit ball of the regularizer. This choice is motivated by the Krein-Milman theorem, which ensures that minimizers can be approximated by linear combinations of extremal points. We investigate metric gradient flows of the optimization problem when restricted to such sparse representations, for which we define a suitable discretized functional that we show to be to be consistent with the original problem via the means of $Γ$-convergence. We prove that the resulting evolution of the latter is well-defined using a minimizing movement scheme, and we establish conditions ensuring $λ$-convexity and uniqueness of the flow. Then, using Choquet's theorem, we lift the problem into the Wasserstein space on weights and extremal points, and consider Wasserstein gradient flows in this lifted setting. Our main result is that the lifting of the AGF evolution is again a metric gradient flow in the Wasserstein space, verifying the consistency of the approach with respect to a Wasserstein-type dynamic. Finally, we illustrate the applicability of AGFs to several relevant infinite-dimensional problems, including optimization of functions of bounded variation and curves of measures regularized by Optimal Transport-type penalties.
comment: A section of Acknowledgements has been insterted in the new manuscript, 53 pages
♻ ☆ Optimality Deviation using the Koopman Operator
This paper investigates the impact of approximation error in data-driven optimal control problem of nonlinear systems while using the Koopman operator. While the Koopman operator enables a simplified representation of nonlinear dynamics through a lifted state space, the presence of approximation error inevitably leads to deviations in the computed optimal controller and the resulting value function. We derive explicit upper bounds for these optimality deviations, which characterize the worst-case effect of approximation error. Supported by numerical examples, these theoretical findings provide a quantitative foundation for improving the robustness of data-driven optimal controller design.
♻ ☆ Stackelberg stopping games
We study a Stackelberg variant of the classical Dynkin game in discrete time, where the two players are no longer on equal footing. Player 1 (the leader) announces her stopping strategy first, and Player 2 (the follower) responds optimally. This Stackelberg stopping game can be viewed as an optimal control problem for the leader. Our primary focus is on the time-inconsistency that arises from the leader-follower game structure. We begin by using a finite-horizon example to clarify key concepts, including precommitment and equilibrium strategies in the Stackelberg setting, as well as the Nash equilibrium in the standard Dynkin game. We then turn to the infinite-horizon case and study randomized precommitment and equilibrium strategies. We provide a characterization for the leader's value induced by precommitment strategies and show that it may fail to attain the supremum. Moreover, we construct a counterexample to demonstrate that a randomized equilibrium strategy may not exist. Then we introduce an entropy-regularized Stackelberg stopping game, in which the follower's optimization is regularized with an entropy term. This modification yields a continuous best response and ensures the existence of a regular randomized equilibrium strategy, which can be viewed as an approximation of the exact equilibrium.
♻ ☆ Robust Incentive Stackelberg Mean Field Stochastic Linear-Quadratic Differential Game with Model Uncertainty
This paper investigates a robust incentive Stackelberg stochastic differential game problem for a linear-quadratic mean field system, where the model uncertainty appears in the drift term of the leader's state equation. Moreover, both the state average and control averages enter into the leader's dynamics and cost functional. Based on the zero-sum game approach, mean field approximation and duality theory, firstly the representation of the leader's limiting cost functional and the closed-loop representation of decentralized open-loop saddle points are given, via decoupling methods. Then by convex analysis and the variational method, the decentralized strategies of the followers' auxiliary limiting problems and the corresponding consistency condition system are derived. Finally, applying decoupling technique, the leader's approximate incentive strategy set is obtained, under which the asymptotical robust incentive optimality of the decentralized mean field strategy is verified. A numerical example is given to illustrate the theoretical results.
comment: 53 pages (some revisions), 9 figures
♻ ☆ Constrained zero-sum LQ differential games for jump-diffusion systems with and random coefficients
This paper investigates a cone-constrained two-player zero-sum stochastic linear-quadratic (SLQ) differential game for stochastic differential equations (SDEs) with regime switching and random coefficients driven by a jump-diffusion process. Under the uniform convexity-concavity (UCC) condition, we establish the open-loop solvability of the game and characterize the open-loop saddle point via the forward-backward stochastic differential equations (FBSDEs). However, since both controls are constrained, the classical four-step scheme fails to provide an explicit expression for the saddle point. To overcome this, by employing Meyer's Itô formula together with the method of completing the square, we derive a closed-loop representation for the open-loop saddle point based on solutions to a new kind of multidimensional indefinite extended stochastic Riccati equations with jumps (IESREJs). Furthermore, for a special case, we prove the existence of solutions to IESREJs.
♻ ☆ The Multi-AMR Buffer Storage, Retrieval, and Reshuffling Problem: Exact and Heuristic Approaches
Buffer zones are essential in production systems to decouple sequential processes. In dense floor storage environments, such as space-constrained brownfield facilities, manual operation is increasingly challenged by severe labor shortages and rising operational costs. Automating these zones requires solving the Buffer Storage, Retrieval, and Reshuffling Problem (BSRRP). While previous work has addressed scenarios where the focus is limited to reshuffling and retrieving a fixed set of items, real-world manufacturing necessitates an adaptive approach that also incorporates arriving unit loads. This paper introduces the Multi-AMR BSRRP, coordinating a robot fleet to manage concurrent reshuffling, alongside time-windowed storage and retrieval tasks, within a shared floor area. We formulate a Binary Integer Programming (IP) model to obtain exact solutions for benchmarking purposes. As the problem is NP-hard, rendering exact methods computationally intractable for industrial scales, we propose a hierarchical heuristic. This approach decomposes the problem into an A* search for task-level sequence planning of unit load placements, and a Constraint Programming (CP) approach for multi-robot coordination and scheduling. Experiments demonstrate orders-of-magnitude computation time reductions compared to the exact formulation. These results confirm the heuristic's viability as responsive control logic for high-density production environments.
comment: 52 pages, 15 figures and tables
♻ ☆ On Information Controls
In this paper we study an optimization problem in which the control is information, more precisely, the control is a $σ$-algebra or a filtration. In a dynamic setting, we establish the dynamic programming principle and the law invariance of the value function. The latter requires a condition slightly stronger than the (H)-hypothesis for the admissible filtration, and enables us to define the value function on $\mathcal P_2(\mathcal P_2(\mathbb R^d))$, the space of laws of random probability measures. By using a new Itô's formula for smooth functions on $\mathcal P_2(\mathcal P_2(\mathbb R^d))$, we characterize the value function of the information control problem by an Hamilton-Jacobi-Bellman equation on this space.
♻ ☆ An interacting particle consensus method for constrained global optimization
This paper presents a particle-based optimization method designed for addressing minimization problems with equality constraints, particularly in cases where the loss function exhibits non-differentiability or non-convexity. The proposed method combines components from consensus-based optimization algorithm with a newly introduced forcing term directed at the constraint set. A rigorous mean-field limit of the particle system is derived, and the convergence of the mean-field limit to the constrained minimizer is established. Additionally, we introduce a stable discretized algorithm and conduct various numerical experiments to demonstrate the performance of the proposed method.
♻ ☆ Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits
We consider the infinite-horizon, average-reward restless bandit problem in discrete time. We propose a new class of policies that are designed to drive a progressively larger subset of arms toward the optimal distribution. We show that our policies are asymptotically optimal with an $O(1/\sqrt{N})$ optimality gap for an $N$-armed problem, assuming only a unichain and aperiodicity assumption. Our approach departs from most existing work that focuses on index or priority policies, which rely on the Global Attractor Property (GAP) to guarantee convergence to the optimum, or a recently developed simulation-based policy, which requires a Synchronization Assumption (SA).
comment: 68 pages, 17 figures
♻ ☆ Gromov-Wasserstein Barycenters: The Analysis Problem
This paper considers the problem of estimating a matrix that encodes pairwise distances in a finite metric space (or, more generally, the edge weight matrix of a network) under the barycentric coding model (BCM) with respect to the Gromov-Wasserstein (GW) distance function. We frame this task as estimating the unknown barycentric coordinates with respect to the GW distance, assuming that the target matrix (or kernel) belongs to the set of GW barycenters of a finite collection of known templates. In the language of harmonic analysis, if computing GW barycenters can be viewed as a synthesis problem, this paper aims to solve the corresponding analysis problem. We propose two methods: one utilizing fixed-point iteration for computing GW barycenters, and another employing a differentiation-based approach to the GW structure using a blow-up technique. Finally, we demonstrate the application of the proposed GW analysis approach in a series of numerical experiments and applications to machine learning.
comment: Accepted for publication in SIAM Journal on Mathematics of Data Science (SIMODS). March 2026
Optimization and Control 21
☆ Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards
We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.
☆ Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version
Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
comment: Extended version for submission to the 2026 IEEE CDC
☆ Surrogate-based categorical neighborhoods for mixed-variable blackbox optimization
In simulation-based engineering, design choices are often obtained following the optimization of complex blackbox models. These models frequently involve mixed-variable domains with quantitative and categorical variables. Unlike quantitative variables, categorical variables lack an inherent structure, which makes them difficult to handle, especially in the presence of constraints. This work proposes a systematic approach to structure and model categorical variables in constrained mixed-variable blackbox optimization. Surrogate models of the objective and constraint functions are used to induce problem-specific categorical distances. From these distances, surrogate-based neighborhoods are constructed using notions of dominance from bi-objective optimization, jointly accounting for information from both the objective and the constraint functions. This study addresses the lack of automatic and constraint-aware categorical neighborhood construction in mixed-variable blackbox optimization. As a proof of concept, these neighborhoods are employed within CatMADS, an extension of the MADS algorithm for categorical variables. The surrogate models are Gaussian processes, and the resulting method is called CatMADS-GP. The method is benchmarked on the Cat-Suite collection of 60 mixed-variable optimization problems and compared against state-of-the-art solvers. Data profiles indicate that CatMADS-GP achieves superior performance for both unconstrained and constrained problems.
comment: 26 pages (without the references), 10 figures, and code available at https://github.com/bbopt/surrogate_based_neighborhoods
☆ Optimal Switching in Networked Control Systems: Finite Horizon
In this work, we first prove that the separation principle holds for switched LQR problems under i.i.d. zero-mean disturbances with a symmetric distribution. We then solve the dynamic programming problem and show that the optimal switching policy is a symmetric threshold rule on the accumulated disturbance since the most recent update, while the optimal controller is a discounted linear feedback law independent of the switching policy.
☆ LP-Based Algorithms for Scheduling in a Quantum Switch
We consider scheduling in a quantum switch with stochastic entanglement generation, finite quantum memories, and decoherence. The objective is to design a scheduling algorithm with polynomial-time computational complexity that stabilizes a nontrivial fraction of the capacity region. Scheduling in such a switch corresponds to finding a matching in a graph subject to additional constraints. We propose an LP-based policy, which finds a point in the matching polytope, which is further implemented using a randomized decomposition into matchings. The main challenge is that service over an edge is feasible only when entanglement is simultaneously available at both endpoint memories, so the effective service rates depend on the steady-state availability induced by the scheduling rule. To address this, we introduce a single-node reference Markov chain and derive lower bounds on achievable service rates in terms of the steady-state nonemptiness probabilities. We then use a Lyapunov drift argument to show that, whenever the request arrival rates lie within the resulting throughput region, the proposed algorithm stabilizes the request queues. We further analyze how the achievable throughput depends on entanglement generation rates, decoherence probabilities, and buffer sizes, and show that the throughput lower bound converges exponentially fast to its infinite-buffer limit as the memory size increases. Numerical results illustrate that the guaranteed throughput fraction is substantial for parameter regimes relevant to near-term quantum networking systems.
☆ Distributed Online Submodular Maximization under Communication Delays: A Simultaneous Decision-Making Approach
We provide a distributed online algorithm for multi-agent submodular maximization under communication delays. We are motivated by the future distributed information-gathering tasks in unknown and dynamic environments, where utility functions naturally exhibit the diminishing-returns property, i.e., submodularity. Existing approaches for online submodular maximization either rely on sequential multi-hop communication, resulting in prohibitive delays and restrictive connectivity assumptions, or restrict each agent's coordination to its one-hop neighborhood only, thereby limiting the coordination performance. To address the issue, we provide the Distributed Online Greedy (DOG) algorithm, which integrates tools from adversarial bandit learning with delayed feedback to enable simultaneous decision-making across arbitrary network topologies. We provide the approximation performance of DOG against an optimal solution, capturing the suboptimality cost due to decentralization as a function of the network structure. Our analyses further reveal a trade-off between coordination performance and convergence time, determined by the magnitude of communication delays. By this trade-off, DOG spans the spectrum between the state-of-the-art fully centralized online coordination approach [1] and fully decentralized one-hop coordination approach [2].
comment: Accepted to ACC 2026
☆ Cut loci and diameters of the Berger lens spaces
In this paper, we study Riemannian metrics on the three-dimensional lens spaces that are deformations of the standard Riemannian metric along the fibers of the Hopf fibration. In other words, these metrics are axisymmetric. There is a one-parametric family of such metrics. This family tends to an axisymmetric sub-Riemannian metric. We find the cut loci and the cut times using methods from geometric control theory. It turns out that the cut loci and the cut times converge to the cut locus and the cut time for the sub-Riemannian structure, that was already studied. Moreover, we get some lower bounds for the diameter of these Riemannian metrics. These bounds coincide with the exact values of diameters for the lens spaces L(p;1).
comment: 14 pages, 6 figures
☆ Bridging Schrödinger and Bass: A Semimartingale Optimal Transport Problem with Diffusion Control
We study a semimartingale optimal transport problem interpolating between the Schrödinger bridge and the stretched Brownian motion associated with the Bass solution of the Skorokhod embedding problem. The cost combines an entropy term on the drift with a quadratic penalization of the diffusion coefficient, leading to a stochastic control problem over drift and volatility. We establish a complete duality theory for this problem, despite the lack of coercivity in the diffusion component. In particular, we prove strong duality and dual attainment, and derive an equivalent reduced dual formulation in terms of a variational problem over terminal potentials. Optimal solutions are characterized by a coupled Schrödinger-Bass bridge system, involving a backward heat potential and a transport map given by the gradient of a $β$-convex function. This system interpolates between the classical Schrödinger system and the Bass martingale transport. Our results furnish a unified framework encompassing entropic and martingale optimal transport, and yield a variational foundation for data-driven diffusion models.
☆ Optimal resource allocation for maintaining system solvency
We study an optimal allocation problem for a system of independent Brownian agents whose states evolve under a limited shared control. At each time, a unit of resource can be divided and allocated across components to increase their drifts, with the objective of maximizing either (i) the probability that all components avoid ruin, or (ii) the expected number of components that avoid ruin. We derive the associated Hamilton-Jacobi-Bellman equations on the positive orthant with mixed boundary conditions at the absorbing boundary and at infinity, and we identify drift thresholds separating trivial and nontrivial regimes. For the all-survive criterion, we establish existence, uniqueness, and smoothness of a bounded classical solution and a verification theorem linking the PDE to the stochastic-control value function. We then investigate the conjectured optimality of the push-the-laggard allocation rule: in the nonnegative-drift regime, we prove that it is optimal for the all-survive value function, while it is not optimal for the count-survivors criterion, by exhibiting a two-dimensional counterexample and then lifting it to all dimensions.
☆ Stability of a Korteweg--de Vries equation close to critical lengths
In this paper, we investigate the quantitative exponential stability of the Korteweg-de Vries equation on a finite interval with its length close to the critical set. Sharp decay estimates are obtained via a constructive PDE control framework. We first introduce a novel transition-stabilization approach, combining the Lebeau--Robbiano strategy with the moment method, to establish constructive null controllability for the KdV equation. This approach is then coupled with precise spectral analysis and invariant manifold theory to characterize the asymptotic behavior of the decay rate as the length of the interval approaches the set of critical lengths. Building on our classification of the critical lengths, we show that the KdV equation exhibits distinct asymptotic behaviors in neighborhoods of different types of critical lengths.
☆ Huber-based Robust System Identification with Near-Optimal Guarantees Across Independent and Adversarial Regimes
Dynamical systems can confront one of two extreme types of disturbances: persistent zero-mean independent noise, and sparse nonzero-mean adversarial attacks, depending on the specific scenario being modeled. While mean-based estimators like least-squares are well-suited for the former, a median-based approach such as the $\ell_1$-norm estimator is required for the latter. In this paper, we propose a Huber-based estimator, characterized by a threshold constant $μ$, to identify the governing matrix of a linearly parameterized nonlinear system from a single trajectory of length $T$. This formulation bridges the gap between mean- and median-based estimation, achieving provably robust error in both extreme disturbance scenarios under mild assumptions. In particular, for persistent zero-mean noise with a positive probability density around zero, the proposed estimator achieves an $\mathcal{O}(1/\sqrt{T})$ error rate if the disturbance is symmetric or the basis functions are linear. For arbitrary nonzero-mean attacks that occur at each time with probability smaller than 0.5, the error is bounded by $\mathcal{O}(μ)$. We validate our theoretical results with experiments illustrating that integrating our approach into frameworks like SINDy yields robust identification of discrete-time systems.
comment: 10 pages, 3 figures
☆ A Robust Low-Rank Prior Model for Structured Cartoon-Texture Image Decomposition with Heavy-Tailed Noise
Cartoon-texture image decomposition is a fundamental yet challenging problem in image processing. A significant hurdle in achieving accurate decomposition is the pervasive presence of noise in the observed images, which severely impedes robust results. To address the challenging problem of cartoon-texture decomposition in the presence of heavy-tailed noise, we in this paper propose a robust low-rank prior model. Our approach departs from conventional models by adopting the Huber loss function as the data-fidelity term, rather than the traditional $\ell_2$-norm, while retaining the total variation norm and nuclear norm to characterize the cartoon and texture components, respectively. Given the inherent structure, we employ two implementable operator splitting algorithms, tailored to different degradation operators. Extensive numerical experiments, particularly on image restoration tasks under high-intensity heavy-tailed noise, efficiently demonstrate the superior performance of our model.
comment: This paper introduces a robust model for cartoon-texture image decomposition with heavy-tailed noise. It has 11 figures and 4 tables
☆ Control Forward-Backward Consistency: Quantifying the Accuracy of Koopman Control Family Models
This paper extends the forward-backward consistency index, originally introduced in Koopman modeling of systems without input, to the setting of control systems, providing a closed-form computable measure of accuracy for data-driven models associated with the Koopman Control Family (KCF). Building on a forward-backward regression perspective, we introduce the control forward-backward consistency matrix and demonstrate that it possesses several favorable properties. Our main result establishes that the relative root-mean-square error of KCF function predictors is strictly bounded by the square root of the control consistency index, defined as the maximum eigenvalue of the consistency matrix. This provides a sharp, closed-form computable error bound for finite-dimensional KCF models. We further specialize this bound to the widely used lifted linear and bilinear models. We also discuss how the control consistency index can be incorporated into optimization-based modeling and illustrate the methodology via simulations.
☆ A Class of Degenerate Hyperbolic Equations with Neumann Boundary Conditions and Its Application to Observability
We establish a mixed observability inequality for a class of degenerate hyperbolic equations on the cylindrical domain $Ω= \mathbb{T} \times (0,1)$ with mixed Neumann Dirichlet boundary conditions. The degeneracy acts only in the radial variable, whereas the periodic angular variable allows propagation with a strong tangential component, making a direct top boundary observation delicate. For $α\in [1,2)$, we prove that the solution can be controlled by a boundary observation on the top boundary together with an interior observation on a narrow strip. The proof combines a weighted functional framework, improved regularity, a cutoff decomposition in the angular variable, a multiplier argument for the localized component, and an energy estimate for the remainder.
♻ ☆ Equilibria in Network Constrained Markets with System Operator
We study a networked economic system composed of $n$ producers supplying a single homogeneous good to a number of geographically separated markets and of a centralized authority, called the market maker. Producers compete à la Cournot, by choosing the quantities of good to supply to each market they have access to in order to maximize their profit. Every market is characterized by its inverse demand functions returning the unit price of the considered good as a function of the total available quantity. Markets are interconnected by a dispatch network through which quantities of the considered good can flow within finite capacity constraints and possibly satisfying additional linear physical constraints. Such flows are determined by the action of a system operator, who aims at maximizing a designated welfare function. We model such competition as a strategic game with $n+1$ players: the producers and the system operator. For this game, we first establish the existence of pure-strategy Nash equilibria under standard concavity assumptions. We then identify sufficient conditions for the game to be exact potential with an essentially unique Nash equilibrium. Next, we present a general result that connects the optimal action of the system operator with the capacity constraints imposed on the network. For the commonly used Walrasian welfare, our finding proves a connection between capacity bottlenecks in the market network and the emergence of price differences between markets separated by saturated lines. This phenomenon is frequently observed in real-world scenarios, for instance in power networks. Finally, we validate the model with data from the Italian day-ahead electricity market.
comment: 16 pages, 8 figures
♻ ☆ A Correspondence-Driven Approach for Bilevel Decision-making with Nonconvex Lower-Level Problems
We consider bilevel optimization problems with general nonconvex lower-level objectives and show that the classical hyperfunction-based formulation is unsettled, since the global minimizer of the lower-level problem is generally unattainable. To address this issue, we propose a correspondence-driven hyperfunction $φ^{\text{cd}}$. In this formulation, the follower is modeled not as a rational agent always attaining a global minimizer, but as an algorithm-based bounded rational agent whose decisions are produced by a fixed algorithm with initialization and step size. Since $φ^{\text{cd}}$ is generally discontinuous, we apply Gaussian smoothing to obtain a smooth approximation $φ^{\text{cd}}_ξ$, then show that its value and gradient converge to those of $φ^{\text{cd}}$. In the nonconvex setting, we identify that bifurcation phenomena, which arise when $g(x,\cdot)$ has a degenerate stationary point, pose a key challenge for hyperfunction-based methods. This is especially the case when $φ^{\text{cd}}_ξ$ is solved using gradient methods. To overcome this challenge, we analyze the geometric structure of the bifurcation set under some weak assumptions. Building on these results, we design a biased projected SGD-based algorithm SCiNBiO to solve $φ^{\text{cd}}_ξ$ with a cubic-regularized Newton lower-level solver. We also provide convergence guarantees and oracle complexity bounds for the upper level. Finally, we connect bifurcation theory from dynamical systems to the bilevel setting and define the notion of fold bifurcation points in this setting. Under the assumption that all degenerate stationary points are fold bifurcation points, we establish the oracle complexity of SCiNBiO for the lower-level problem.
♻ ☆ A model of strategic sustainable investment
We study a problem of optimal irreversible investment and emission reduction formulated as a nonzero-sum dynamic game between an investor with environmental preferences and a firm. The game is set in continuous time on an infinite-time horizon. The firm generates profits with a stochastic dynamics and may spend part of its revenues towards emission reduction (e.g., renovating the infrastructure). The firm's objective is to maximize the discounted expectation of a function of its profits. The investor participates in the profits, may decide to invest to support the firm's production capacity and uses a profit function which accounts for both financial and environmental factors. Nash equilibria of the game are obtained via a system of variational inequalities. We formulate a general verification theorem for this system in a diffusive setup and construct an explicit solution in the zero-noise limit. Our explicit results and numerical approximations show that both the investor's and the firm's optimal actions are triggered by moving boundaries that increase with the total amount of emission abatement.
comment: 49 pages; 12 figures; we added uniqueness of the equilibrium in the zero noise limit and expanded sensitivity analysis
♻ ☆ Non-Convex Global Optimization as an Optimal Stabilization Problem: Convergence Rates
We develop a rigorous framework for global non-convex optimization by reformulating the minimization problem as a discounted infinite-horizon optimal control problem. For non-convex, continuous, and possibly non-smooth objective functions with multiple global minimizers, where classical gradient-based methods lack global convergence guarantees, we establish explicit exponential convergence rates with computable constants. Our analysis proves (i) variational convergence of the value function of the optimal control problem, (ii) convergence in the objective function for the original problem, as well as (iii) pathwise convergence of optimal trajectories to the minimizer set under minimal structural assumptions that require neither convexity, differentiability, nor Łojasiewicz-type conditions on the objective. These quantitative results significantly strengthen the asymptotic theory developed in our previous work (arXiv:2511.10815). Numerical experiments demonstrate the practical effectiveness of the approach on challenging non-convex problems.
♻ ☆ Bayesian distributionally robust variational inequalities: regularization and quantification
We propose a Bayesian distributionally robust variational inequality (DRVI) framework that models the data-generating distribution through a finite mixture family, which allows us to study the DRVI on a tractable finite-dimensional parametric ambiguity set. To address distributional uncertainty, we construct a data-driven ambiguity set with posterior coverage guarantees via Bayesian inference. We also employ a regularization approach to ensure numerical stability. We prove the existence of solutions to the Bayesian DRVI and the asymptotic convergence to a solution as sample size grows to infinity and the regularization parameter goes to zero. Moreover, we derive quantitative stability bounds and finite-sample guarantees under data scarcity and contamination. Numerical experiments on a distributionally robust multi-portfolio Nash equilibrium problem validate our theoretical results and demonstrate the robustness and reliability of Bayesian DRVI solutions in practice.
♻ ☆ American Options with Last Exit Times: A Free-Boundary Approach
We study the valuation of an American put option with a random time horizon given by the last exit time of the underlying asset from a fixed level. Since this random time is not a stopping time, the problem falls outside the classical optimal stopping framework. Using enlargement of filtrations and the associated Azéma supermartingale, we transform the problem into an equivalent optimal stopping problem with a semi-continuous, time-dependent gain function whose partial derivatives exhibit singular behaviour. The resulting formulation introduces significant analytical challenges, including the loss of smoothness of the optimal stopping boundary. We develop new arguments to characterise the continuation and stopping regions, establishing monotonicity of the free boundary under suitable conditions, and analyse the regularity of the value function. In particular, we derive nonlinear integral equations that uniquely characterise both the free-boundary and the value function. Our results extend the classical theory of American options to a class of problems with random horizons and provide a framework for incorporating default-type features modelled by last exit times.
♻ ☆ An Operator Splitting Method for Large-Scale CVaR-Constrained Quadratic Programs
We introduce a fast and scalable method for solving quadratic programs with conditional value-at-risk (CVaR) constraints. While these problems can be formulated as standard quadratic programs, the number of variables and constraints grows linearly with the number of scenarios, making general-purpose solvers impractical for large-scale problems. Our method combines operator splitting with a specialized $O(m\log m)$ algorithm for projecting onto CVaR constraints, where $m$ is the number of scenarios. The method alternates between solving a linear system and performing parallel projections, onto CVaR constraints using our specialized algorithm and onto box constraints by simple clipping. Numerical examples from several application domains demonstrate that our method outperforms general-purpose solvers by several orders of magnitude on problems with up to millions of scenarios. Our method is implemented in an open-source package called CVQP.
Optimization and Control 25
☆ Dynamic Constrained Stabilization on the $n$-sphere
We consider the constrained stabilization problem of second-order systems evolving on the n-sphere. We propose a control strategy with a constraint proximity-based dynamic damping mechanism that ensures safe and almost global asymptotic stabilization of the target point in the presence of star-shaped constraints on the n-sphere. It is also shown that the proposed approach can be used to deal with the constrained rigid-body attitude stabilization. The effectiveness of the proposed approach is demonstrated through simulation results on the 2-sphere in the presence of star-shaped constraint sets.
comment: 10 pages, 1 figure
☆ Multidimensional Gradient-MUSIC: A Global Nonconvex Optimization Framework for Optimal Resolution
We develop a multidimensional version of Gradient-MUSIC for estimating the frequencies of a nonharmonic signal from noisy samples. The guiding principle is that frequency recovery should be based only on the signal subspace determined by the data. From this viewpoint, the MUSIC functional is an economical nonconvex objective encoding the relevant information, and the problem becomes one of understanding the geometry of its perturbed landscape. Our main contribution is a general structural theory showing that, under explicit conditions on the measurement kernel and the perturbation of the signal subspace, the perturbed MUSIC function is an admissible optimization landscape: suitable initial points can be found efficiently by coarse thresholding, gradient descent converges to the relevant local minima, and these minima obey quantitative error bounds. Thus the theory is not merely existential; it provides a constructive global optimization framework for multidimensional optimal resolution. We verify the abstract conditions in detail for two canonical sampling geometries: discrete samples on a cube and continuous samples on a ball. In both cases we obtain uniform, nonasymptotic recovery guarantees under deterministic as well as stochastic noise. In particular, for lattice samples in a cube of side length $4m$, if the true frequencies are separated by at least $β_d/m$ and the noise has $\ell^\infty$ norm at most $\varepsilon$, then Gradient-MUSIC recovers the frequencies with error at most \[ C_d \frac{\varepsilon}{m}, \] where $C_d, β_d>0$ depend only on the dimension. This scaling is minimax optimal in $m$ and $\varepsilon$. Under stationary Gaussian noise, the error improves to \[ C_d\frac{σ\sqrt{\log(m)}}{m^{1+d/2}}. \] This is the noisy super-resolution scaling: (see paper for rest of abstract)
comment: 63 pages, 4 figures
☆ The Risk Quadrangle in Optimization: An Overview with Recent Results and Extensions
This paper revisits and extends the 2013 development by Rockafellar and Uryasev of the Risk Quadrangle (RQ) as a unified scheme for integrating risk management, optimization, and statistical estimation. The RQ features four stochastics-oriented functionals -- risk, deviation, regret, and error, along with an associated statistic, and articulates their revealing and in some ways surprising interrelationships and dualizations. Additions to the RQ framework that have come to light since 2013 are reviewed in a synthesis focused on both theoretical advancements and practical applications. New quadrangles -- superquantile, superquantile norm, expectile, biased mean, quantile symmetric average union, and $\varphi$-divergence-based quadrangles -- offer novel approaches to risk-sensitive decision-making across various fields such as machine learning, statistics, finance, and PDE-constrained optimization. The theoretical contribution comes in axioms for ``subregularity'' relaxing ``regularity'' of the quadrangle functionals, which is too restrictive for some applications. The main RQ theorems and connections are revisited and rigorously extended to this more ample framework. Examples are provided in portfolio optimization, regression, and classification, demonstrating the advantages and the role played by duality, especially in ties to robust optimization and generalized stochastic divergences.
☆ Size-Selective Threshold Harvesting under Nonlocal Crowding and Exogenous Recruitment
In this paper, we formulate and analyze an original infinite-horizon bioeconomic optimal control problem for a nonlinear, size-structured fish population. Departing from standard endogenous reproduction frameworks, we model population dynamics using a McKendrick--von Foerster partial differential equation characterized by strictly exogenous lower-boundary recruitment and a nonlocal crowding index. This nonlocal environment variable governs density-dependent individual growth and natural mortality, accurately reflecting the ecological pressures of enhancement fisheries or heavily subsidized stocks. We first establish the existence and uniqueness of the no-harvest stationary profile and introduce a novel intrinsic replacement index tailored to exogenously forced systems, which serves as a vital biological diagnostic rather than a classical persistence threshold. To maximize discounted economic revenue, we derive formal first-order necessary conditions via a Pontryagin-type maximum principle. By introducing a weak-coupling approximation to the adjoint system and applying a single-crossing assumption, we mathematically prove that the optimal size-selective harvesting strategy is a rigorous bang-bang threshold policy. A numerical case study calibrated to an Atlantic cod (\textit{Gadus morhua}) fishery bridges our theoretical framework with applied management. The simulations confirm that the economically optimal minimum harvest size threshold ($66.45$ cm) successfully maintains the intrinsic replacement index above unity, demonstrating that precisely targeted, size-structured harvesting can seamlessly align economic maximization with long-run biological viability.
☆ Budgeted Robust Intervention Design for Financial Networks with Common Asset Exposures
In the context of containment of default contagion in financial networks, we here study a regulator that allocates pre-shock capital or liquidity buffers across banks connected by interbank liabilities and common external asset exposures. The regulator chooses a nonnegative buffer vector under a linear budget before asset-price shocks realize. Shocks are modeled as belonging to either an $\ell_{\infty}$ or an $\ell_{1}$ uncertainty set, and the design objective is either to enlarge the certified no-default/no-insolvency region or to minimize worst-case clearing losses at a prescribed stress radius. Four exact synthesis results are derived. The buffer that maximizes the default resilience margin is obtained from a linear program and admits a closed-form minimal-budget certificate for any target margin. The buffer that maximizes the insolvency resilience margin is computed by a single linear program. At a fixed radius, minimizing the worst-case systemic loss is again a linear program under $\ell_{\infty}$ uncertainty and a linear program with one scenario block per asset under $\ell_{1}$ uncertainty. Crucially, under $\ell_{1}$ uncertainty, exact robustness adds only one LP block per asset, ensuring that the computational complexity grows linearly with the number of assets. A corollary identifies the exact budget at which the optimized worst-case loss becomes zero. Numerical experiments on the 8-bank benchmark of \cite{Calafiore2025}, on a synthetic core-periphery network, and on a data-backed 107-bank calibration built from the 2025 EBA transparency exercise show large gains over uniform and exposure-proportional allocations. The empirical results also indicate that resilience-maximizing and loss-minimizing interventions nearly coincide under diffuse $\ell_\infty$ shocks, but diverge under concentrated $\ell_1$ shocks.
☆ Linear-quadratic mixed Stackelberg-zero-sum game for mean-field regime switching system
Motivated by a product pricing problem, a linear-quadratic Stackelberg differential game for a regime switching system involving one leader and two followers is studied. The two followers engage in a zero-sum differential game, and both the state system and the cost functional incorporate a conditional mean-field term. Applying continuation method and induction method, we first establish the existence and uniqueness of a conditional mean-field forward-backward stochastic differential equation with Markovian switching. Based on it, we prove the unique solvability of Hamiltonian systems for the two followers and the leader. Moreover, utilizing stochastic maximum principle, decoupling approach and optimal filtering technique, the optimal feedback strategies of two followers and leader are obtained. Employing the theoretical results, we solve a product pricing problem with some numerical simulations.
☆ On the Convergence Rate of the One-Hop Transfer Algorithm
The transfer algorithm~\cite{jiang} solves the on-chain one-hop swap routing problem. In \cite{jiang}, the convergence is proved but the convergence rate is left open. We prove that the algorithm terminates in at most $\mathcal{O}(Nκ\log\frac{1}{\varepsilon})$ rounds, where $N$ is the number of AMMs, $κ$ a liquidity heterogeneity parameter and $\varepsilon$ a tolerance parameter.
☆ Time Window-Based Netload Range Cost Curves for Coordinated Transmission and Distribution Planning Under Uncertainty
Mechanisms to coordinate transmission and distribution planning should be regulatory compliant and keep the spheres of DSO and TSO decisions separate, without requiring disclosure of proprietary data or unrealistic computationally expensive T&D co-simulations. The concept of Netload Range Cost Curves (NRCC) has been recently proposed as simple non-invasive form of coordinating T&D investments under distribution netload uncertainty. This paper extends the NRCC concept to accommodate the temporal dimension of the T&D planning process. We propose to compute a hierarchy of certified temporal interface products that represent the different levels of flexibility that distribution networks can provide transmission grids with at the planning stage. The first product (P1) maps distribution investment into scenario-robust, per-window service envelopes within which any TSO service call (to modify load within specified bounds) is guaranteed distribution-network-feasible. The second product (P2) adds lexicographic rebound minimization, preserving P1-optimal service capacity while certifying post-service recovery under three governance variants with qualitatively distinct rebound-budget responses. In our numerical results, based on a real distribution feeder, we compare the performance of our proposed time-window-based flexibility products to an atemporal product (P0) that offers a static bound on the aggregate distribution grid netload across all time periods. Our results demonstrate the superiority of our proposed products in properly valuing the benefits of incremental investments in storage to allow for temporal flexibility.
☆ Online Learning of Kalman Filtering: From Output to State Estimation
In this paper, we study the problem of learning Kalman filtering with unknown system model in partially observed linear dynamical systems. We propose a unified algorithmic framework based on online optimization that can be used to solve both the output estimation and state estimation scenarios. By exploring the properties of the estimation error cost functions, such as conditionally strong convexity, we show that our algorithm achieves a $\log T$-regret in the horizon length $T$ for the output estimation scenario. More importantly, we tackle the more challenging scenario of learning Kalman filtering for state estimation, which is an open problem in the literature. We first characterize a fundamental limitation of the problem, demonstrating the impossibility of any algorithm to achieve sublinear regret in $T$. By further introducing a random query scheme into our algorithm, we show that a $\sqrt{T}$-regret is achievable when rendering the algorithm limited query access to more informative measurements of the system state in practice. Our algorithm and regret readily capture the trade-off between the number of queries and the achieved regret, and shed light on online learning problems with limited observations. We validate the performance of our algorithms using numerical examples.
☆ Meta-Contrastive Learning for Vision-Language Models via Task-Adaptive CLIP Training
We propose Domain-Conditioned Meta-Contrastive Learning, a framework for improving the cross-domain generalization of vision-language models. While contrastive models such as CLIP achieve strong performance through large-scale training, they rely on a global objective that does not explicitly account for domain shift. To address this limitation, we formulate multimodal learning as a bilevel meta-learning problem over domain-conditioned tasks. Specifically, we introduce domain embeddings that modulate image and text representations, and optimize the model for rapid adaptation to domain-specific distributions via gradient-based inner-loop updates. In addition, we incorporate a cross-domain alignment regularization to encourage domain-invariant representations. Our approach is compatible with standard contrastive training pipelines and can be applied to heterogeneous datasets spanning natural and medical domains. We expect improved robustness under domain shift and enhanced few-shot adaptation performance, highlighting a promising direction for scalable multimodal learning.
☆ Dynamic resource matching in manufacturing using deep reinforcement learning
Matching plays an important role in the logical allocation of resources across a wide range of industries. The benefits of matching have been increasingly recognized in manufacturing industries. In particular, capacity sharing has received much attention recently. In this paper, we consider the problem of dynamically matching demand-capacity types of manufacturing resources. We formulate the multi-period, many-to-many manufacturing resource-matching problem as a sequential decision process. The formulated manufacturing resource-matching problem involves large state and action spaces, and it is not practical to accurately model the joint distribution of various types of demands. To address the curse of dimensionality and the difficulty of explicitly modeling the transition dynamics, we use a model-free deep reinforcement learning approach to find optimal matching policies. Moreover, to tackle the issue of infeasible actions and slow convergence due to initial biased estimates caused by the maximum operator in Q-learning, we introduce two penalties to the traditional Q-learning algorithm: a domain knowledge-based penalty based on a prior policy and an infeasibility penalty that conforms to the demand-supply constraints. We establish theoretical results on the convergence of our domain knowledge-informed Q-learning providing performance guarantee for small-size problems. For large-size problems, we further inject our modified approach into the deep deterministic policy gradient (DDPG) algorithm, which we refer to as domain knowledge-informed DDPG (DKDDPG). In our computational study, including small- and large-scale experiments, DKDDPG consistently outperformed traditional DDPG and other RL algorithms, yielding higher rewards and demonstrating greater efficiency in time and episodes.
comment: 29 pages, 6 figures, 3 tables; Published in European Journal of Operational Research, Vol. 318(2), 2024
♻ ☆ Direct-search methods for decentralized blackbox optimization
Derivative-free optimization algorithms are particularly useful for tackling blackbox optimization problems where the objective function arises from complex and expensive procedures that preclude the use of classical gradient-based methods. In contemporary decentralized environments, such functions are defined locally on different computational nodes due to technical or privacy constraints, introducing additional challenges within the optimization process. In this paper, we adapt direct-search methods, a classical technique in derivative-free optimization, to the decentralized setting. In contrast with zeroth-order algorithms, our algorithms rely on positive spanning sets to define suitable search directions while still possessing global convergence guaranties, thanks to carefully chosen stepsizes. Numerical experiments highlight the advantages of direct-search techniques over gradient-approximation-based strategies.
♻ ☆ Marginal flows of non-entropic weak Schrödinger bridges
This paper introduces a dynamic formulation of divergence-regularized optimal transport with weak targets on the path space. In our formulation, the classical relative entropy penalty is replaced by a general convex divergence, and terminal constraints are imposed in a weak sense. We establish well-posedness and a convex dual formulation, together with a dual existence result and explicit structural characterizations of primal and dual optimizers. Specifically, the optimal path measure admits an explicit density relative to a reference diffusion, generalizing the classical Schr{ö}dinger system. In the case of zero transport cost, which corresponds to a non-entropic dynamic Schr{ö}dinger problem, we further characterize the flow of time marginals of the optimal bridge, recovering known results in the entropic setting and providing new descriptions for non-entropic divergences, including the $χ^2$-divergence
comment: New dual existence result, Theorem 3.6
♻ ☆ Scalable Neural Network Verification with Branch-and-bound Inferred Cutting Planes NeurIPS 2024
Recently, cutting-plane methods such as GCP-CROWN have been explored to enhance neural network verifiers and made significant advances. However, GCP-CROWN currently relies on generic cutting planes (cuts) generated from external mixed integer programming (MIP) solvers. Due to the poor scalability of MIP solvers, large neural networks cannot benefit from these cutting planes. In this paper, we exploit the structure of the neural network verification problem to generate efficient and scalable cutting planes specific for this problem setting. We propose a novel approach, Branch-and-bound Inferred Cuts with COnstraint Strengthening (BICCOS), which leverages the logical relationships of neurons within verified subproblems in the branch-and-bound search tree, and we introduce cuts that preclude these relationships in other subproblems. We develop a mechanism that assigns influence scores to neurons in each path to allow the strengthening of these cuts. Furthermore, we design a multi-tree search technique to identify more cuts, effectively narrowing the search space and accelerating the BaB algorithm. Our results demonstrate that BICCOS can generate hundreds of useful cuts during the branch-and-bound process and consistently increase the number of verifiable instances compared to other state-of-the-art neural network verifiers on a wide range of benchmarks, including large networks that previous cutting plane methods could not scale to. BICCOS is part of the $α,β$-CROWN verifier, the VNN-COMP 2024 winner. The code is available at http://github.com/Lemutisme/BICCOS .
comment: Accepted by NeurIPS 2024. BICCOS is part of the alpha-beta-CROWN verifier, the VNN-COMP 2024 winner; fixed Theorem 3.2 and clarified experimental results
♻ ☆ Learning to Relax Nonconvex Quadratically Constrained Quadratic Programs
Quadratically constrained quadratic programs (QCQPs) are ubiquitous in optimization: Such problems arise in applications from operations research, power systems, signal processing, chemical engineering, and portfolio theory, among others. Despite their flexibility in modeling real-life situations and the recent effort to understand their properties, nonconvex QCQPs are hard to solve in practice. Most of the approaches in the literature are based on either Linear Programming (LP) or Semidefinite Programming (SDP) relaxations, each of which works very well for some problem subclasses but perform poorly on others. In this paper, we develop a relaxation selection procedure for nonconvex QCQPs that can adaptively decide whether an LP- or SDP-based approach is expected to be more beneficial by considering the instance structure. The proposed methodology relies on utilizing machine learning methods that involve features derived from spectral properties and sparsity patterns of data matrices, and once trained appropriately, the prediction model applies to any instance with an arbitrary number of variables and constraints. We develop classification and regression models under different feature-design setups, including a dimension-independent representation, and evaluate them on both synthetically generated instances and benchmark instances from MINLPLib. Our computational results demonstrate the effectiveness of the proposed approach for predicting the more favorable relaxation across diverse QCQP families.
♻ ☆ Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning
We study distributed reinforcement learning (RL) with policy gradient methods under asynchronous and parallel computations and communications. While non-distributed methods are well understood theoretically and have achieved remarkable empirical success, their distributed counterparts remain less explored, particularly in the presence of heterogeneous asynchronous computations and communication bottlenecks. We introduce two new algorithms, Rennala NIGT and Malenia NIGT, which implement asynchronous policy gradient aggregation and achieve state-of-the-art efficiency. In the homogeneous setting, Rennala NIGT provably improves the total computational and communication complexity while supporting the AllReduce operation. In the heterogeneous setting, Malenia NIGT simultaneously handles asynchronous computations and heterogeneous environments with strictly better theoretical guarantees. Our results are further corroborated by experiments, showing that our methods significantly outperform prior approaches.
♻ ☆ Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $σ^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $τ_{s}$ and $τ_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $\frac{h σ^2 L Δ}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $Δ= f(x^0) - f^*,$ and $x^0 \in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $τ_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $τ_{s} d \frac{L Δ}{\varepsilon}$ and the variance-dependent runtime term $\frac{h σ^2 L Δ}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new "worst-case" function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.
♻ ☆ Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
We propose a new unifying framework, Birch SGD, for analyzing and designing distributed SGD methods. The central idea is to represent each method as a weighted directed tree, referred to as a computation tree. Leveraging this representation, we introduce a general theoretical result that reduces convergence analysis to studying the geometry of these trees. This perspective yields a purely graph-based interpretation of optimization dynamics, offering a new and intuitive foundation for method development. Using Birch SGD, we design eight new methods and analyze them alongside previously known ones, with at least six of the new methods shown to have optimal computational time complexity. Our research leads to two key insights: (i) all methods share the same "iteration rate" of $O\left(\frac{(R + 1) L Δ}{\varepsilon} + \frac{σ^2 L Δ}{\varepsilon^2}\right)$, where $R$ the maximum "tree distance" along the main branch of a tree; and (ii) different methods exhibit different trade-offs-for example, some update iterates more frequently, improving practical performance, while others are more communication-efficient or focus on other aspects. Birch SGD serves as a unifying framework for navigating these trade-offs. We believe these results provide a unified foundation for understanding, analyzing, and designing efficient asynchronous and parallel optimization methods.
♻ ☆ Bifurcations of the Hénon map with additive bounded noise
We numerically study bifurcations of attractors of the Hénon map with additive bounded noise with spherical reach. The bifurcations are analysed using a finite-dimensional boundary map. We distinguish between two types of bifurcations: topological bifurcations and boundary bifurcations. Topological bifurcations describe discontinuous changes of attractors and boundary bifurcations occur when singularities of an attractor's boundary are created or destroyed. We identify correspondences between topological and boundary bifurcations of attractors and local and global bifurcations of the boundary map.
comment: 26 pages, 17 figures
♻ ☆ Minimal Perimeter Triangle in Nonconvex Quadrangle: Generalized Fagnano Problem
In 1775, Fagnano introduced the following geometric optimization problem: inscribe a triangle of minimal perimeter in a given acute-angled triangle. A widely accessible solution is provided by the Hungarian mathematician L. Fejer in 1900. This paper presents a specific generalization of the classical Fagnano problem, which states that given a nonconvex quadrangle (having one reflex angle and others are acute angles), find a triangle of minimal perimeter with exactly one vertex on each of the sides that do not form reflex angle, and the third vertex lies on either of the sides forming the reflex angle. We provide its geometric solution. Additionally, we establish an upper bound for the classical Fagnano problem, demonstrating that the minimal perimeter of the triangle inscribed in a given acute-angled triangle cannot exceed twice the length of any of its sides
comment: To appear in KOG, 13 pages
♻ ☆ Spatial-temporal risk field-based coupled dynamic-static driving risk assessment and trajectory planning in weaving segments
In this paper, we first propose a spatial-temporal coupled risk assessment paradigm by constructing a three-dimensional spatial-temporal risk field (STRF). Specifically, we introduce spatial-temporal distances to quantify the impact of future trajectories of dynamic obstacles. We also incorporate a geometrically configured specialized field for the weaving segment to constrain vehicle movement directionally. To enhance the STRF's accuracy, we further developed a parameter calibration method using real-world aerial video data, leveraging YOLO-based machine vision and dynamic risk balance theory. A comparative analysis with the traditional risk field demonstrates the STRF's superior situational awareness of anticipatory risk. Building on these results, we final design a STRF-based CAV trajectory planning method in weaving segments. We integrate spatial-temporal risk occupancy maps, dynamic iterative sampling, and quadratic programming to enhance safety, comfort, and efficiency. By incorporating both dynamic and static risk factors during the sampling phase, our method ensures robust safety performance. Additionally, the proposed method simultaneously optimizes path and speed using a parallel computing approach, reducing computation time. Real-world cases show that, compared to the dynamic planning + quadratic programming schemes, and real human driving trajectories, our method significantly improves safety, reduces lane-change completion time, and minimizes speed fluctuations.
♻ ☆ Urgent Samples in Clinical Laboratories: Stochastic Batching to Minimize Patient Turnaround Time
This paper addresses the problem of batching laboratory samples in hospital laboratories where samples of different priorities are received continuously with uncertain transportation times. The focus is on optimizing the control strategy for loading a centrifuge to minimize patient turnaround time (TAT). While focusing on samples of patients in life-threatening situations (i.e., vital samples), we propose several online and offline methods, including a stochastic mixed-integer quadratic programming model integrated within a discrete-event system simulation. This paper aims to enhance patient care by providing timely laboratory results through improved batching strategies. The case study, which uses real data from a university hospital, demonstrates that incorporating distributional knowledge of transport times into our decision policy can reduce the median patient TAT of vital samples by 4.9 minutes and the 0.95 quantile by 9.7 minutes, but has no significant effect on low-priority samples. In addition, we show that this is essentially an optimal result by comparison with the upper bound obtained by a perfect-knowledge offline algorithm.
♻ ☆ Energy-Gain Control of Time-Varying Systems: Receding Horizon Approximation
Standard formulations of prescribed worst-case disturbance energy-gain control policies for linear time-varying systems depend on all forward model data. In discrete time, this dependence arises through a backward Riccati recursion. This article is about the infinite-horizon $\ell_2$ gain performance of state feedback policies with only finite receding-horizon preview of the model parameters. The proposed synthesis of controllers subject to such a constraint leverages the strict contraction of lifted Riccati operators under uniform controllability and observability. The main approximation result is a sufficient number of preview steps for the incurred performance loss to remain below any set tolerance, relative to the baseline gain bound of the associated infinite-preview controller. Aspects of the result are explored in a numerical example.
comment: Accepted to appear in IEEE TAC
♻ ☆ Convergence Analysis of a Relative-type Inexact Preconditioned Proximal ALM for Convex Nonlinear Programming
This article investigates the convergence properties of a relative-type inexact preconditioned proximal augmented Lagrangian method (rip$^2$ALM) for convex nonlinear programming, a fundamental class of optimization problems with broad applications in science and engineering. Inexact proximal augmented Lagrangian methods have proven to be highly effective for solving such problems, owing to their attractive theoretical properties and strong practical performance. However, the convergence behavior of the relative-type inexact preconditioned variant remains insufficiently understood. This work aims to reduce this gap by rigorously establishing the global convergence of the sequence generated by rip$^2$ALM and proving its asymptotic (super)linear convergence rate under standard assumptions. In addition, we derive the global ergodic convergence rate with respect to both the primal feasibility violation and the primal objective residual, thereby offering a more comprehensive understanding of the overall performance of rip$^2$ALM. These results deepen our theoretical understanding of the family of proximal augmented Lagrangian methods and motivate their development for practical, large-scale structured application problems.
♻ ☆ Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study
We study continuous-time mean--variance portfolio selection in markets where stock prices are diffusion processes driven by observable factors that are also diffusion processes, yet the coefficients of these processes are unknown. Based on the recently developed reinforcement learning (RL) theory for diffusion processes, we present a general data-driven RL approach that learns the pre-committed investment strategy directly without attempting to learn or estimate the market coefficients. For multi-stock Black--Scholes markets without factors, we further devise an algorithm and prove its performance guarantee by deriving a sublinear regret bound in terms of the Sharpe ratio. We then carry out an extensive empirical study implementing this algorithm to compare its performance and trading characteristics, evaluated under a host of common metrics, with a large number of widely employed portfolio allocation strategies on S\&P 500 constituents. The results demonstrate that the proposed continuous-time RL strategy is consistently among the best, especially in a volatile bear market, and decisively outperforms the model-based continuous-time counterparts by significant margins.
comment: 94 pages, 8 figures, 18 tables
Optimization and Control 35
☆ On the Reliability Limits of LLM-Based Multi-Agent Planning
This technical note studies the reliability limits of LLM-based multi-agent planning as a delegated decision problem. We model the LLM-based multi-agent architecture as a finite acyclic decision network in which multiple stages process shared model-context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common-evidence regime, this implies that optimizing over multi-agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget-constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.
comment: Technical note
☆ Biased Mean Quadrangle and Applications
This paper introduces \emph{biased mean regression}, estimating the \emph{biased mean}, i.e., $\mathbb{E}[Y] + x$, where $x \in \mathbb{R}$. The approach addresses a fundamental statistical problem that covers numerous applications. For instance, it can be used to estimate factors driving portfolio loss exceeding the expected loss by a specified amount (e.g., $ x=\$10 billion$) or to estimate factors impacting a specific excess release of radiation in the environment, where nuclear safety regulations specify different severity levels. The estimation is performed by minimizing the so-called \emph{superexpectation error}. We establish two equivalence results that connect the method to popular paradigms: (i) biased mean regression is equivalent to quantile regression for an appropriate parameterization and is equivalent to ordinary least squares when $x=0$; (ii) in portfolio optimization, minimizing \emph{superexpectation risk}, associated with the superexpectation error, is equivalent to CVaR optimization. The approach is computationally attractive, as minimizing the superexpectation error reduces to linear programming (LP), thereby offering algorithmic and modeling advantages. It is also a good alternative to ordinary least squares (OLS) regression. The approach is based on the \emph{Risk Quadrangle} (RQ) framework, which links four stochastic functionals -- error, regret, risk, and deviation -- through a statistic. For the biased mean quadrangle, the statistic is the biased mean. We study properties of the new quadrangle, such as \emph{subregularity}, and establish its relationship to the quantile quadrangle. Numerical experiments confirm the theoretical statements and illustrate the practical implications.
☆ Optimal Parlay Wagering and Whitrow Asymptotics: A State-Price and Implicit-Cash Treatment
For independent multi-outcome events under multiplicative parlay pricing, we give a short exact proof of the optimal Kelly strategy using the implicit-cash viewpoint. The proof is entirely eventwise. One first solves each event in isolation. The full simultaneous optimizer over the entire menu of singles, doubles, triples, and higher parlays is then obtained by taking the outer product of the one-event Kelly strategies. Equivalently, the optimal terminal wealth factorizes across events. This yields an immediate active-leg criterion: a parlay is active if and only if each of its legs is active in the corresponding one-event problem. The result recovers, in a more transparent state-price form, the log-utility equivalence between simultaneous multibetting and sequential Kelly betting. We then study what is lost when one forbids parlays and allows only singles. In a low-edge regime and on a fixed active support, the exact parlay optimizer supplies the natural reference point. The singles-only problem is a first-order truncation of the factorized wealth formula. A perturbative expansion shows that the growth-rate loss from forbidding parlays is $\OO(\eps^4)$, while the optimal singles stakes deviate from the isolated one-event Kelly stakes only at cubic order. This yields a clean explanation of Whitrow's empirical near-proportionality phenomenon: the simultaneous singles-only optimizer is obtained from the isolated eventwise optimizer by an event-specific cubic shrinkage, so the portfolios agree through second order and differ only by a small blockwise drag.
comment: 10 pages, 0 figures
☆ A Canceling Heuristic for the Directed Traveling Salesman Problem
The Traveling Salesman Problem (TSP) is one of the classic and hard problems in combinatorial optimization. We develop a new heuristic that uses a connection between Minimum Cost Flow Problems and the TSP to improve on a given suboptimal tour, such as a local optimum found using a classic heuristic. Minimum Cost Flow Problems can be solved efficiently through linear programming or combinatorial algorithms based on cycle canceling. We investigate the potential of flow-canceling in the context of the TSP. Through a restriction of the search space to cycles and circulations that alternate between arcs in- and outside of the tour, practical results exhibit that only a low number of subtours is created, and a lightweight patching step suffices for a high success rate and gap closure towards an optimum.
☆ Incomplete pairwise comparison matrices and their applications
Incomplete pairwise comparison matrices are increasingly employed to save resources and reduce cognitive load by collecting only a subset of all possible pairwise comparisons. We present their graph representation and some completion algorithms, including the incomplete eigenvector and incomplete logarithmic least squares methods, as well as a lexicographical minimisation of triad inconsistencies. The issue of ordinal violations is discussed for matrices generated by directed acyclic graphs and the best--worst method. We also show a reasonable approach to generalise the inconsistency threshold based on the dominant eigenvalue to the incomplete case, and state recent results on the optimal order of obtaining pairwise comparisons. The benefits of using incomplete pairwise comparisons are highlighted by several applications.
comment: 21 pages, 6 figures, 2 tables
☆ The adjoint state method for parametric definable optimization without smoothness or uniqueness
Definable parametric optimization problems with possibly nonsmooth objectives, inequality constraints, and non-unique primal and dual solutions admit an adjoint state formula under a mere qualification condition. The adjoint construction yields a selection of a conservative field for the value function, providing a computable first-order object without requiring differentiation of the solution mapping. Through examples, we show that even in smooth problems, the formal adjoint construction fails without conservativity or definability, illustrating the relevance of these concepts to grasp theoretical aspects of the method. This work provides a tool which can be directly combined with existing primal-dual solvers for a wide range of parametric optimization problems.
comment: 24 pages, 1 figure
☆ Towards European Hydrogen Market Design: Perspectives from Transmission System Operators
Despite hydrogen being central to Europe's decarbonisation strategy, only a small share of renewable hydrogen projects reached final investment decision. A key barrier is uncertainty about how future hydrogen markets will be designed and operated, particularly under Renewable Fuels of Non-Biological Origin requirements. This study investigates the extent to which future hydrogen market design can be adapted from existing natural gas markets, and the challenges it must address. The analysis was based on a survey targeting European gas transmission system operators, structured around five components: market design principles, trading frameworks, capacity allocation, tariffs, and balancing. The survey produced two outputs: an assessment of mechanism transferability and an identification of challenges for early hydrogen market development. Core market design principles and trading frameworks are broadly transferable from natural gas markets, as entry-exit systems and virtual trading points. Capacity allocation requires targeted adaptation to improve coupling with electricity markets. Tariffs require adaptation through intertemporal cost allocation, distributing infrastructure costs over time to protect early adopters. Balancing regimes should be revisited to reflect hydrogen's physical characteristics and different linepack flexibility usages. Key challenges for early hydrogen markets include: temporal mismatches between variable renewable supply and expected relatively stable industrial demand, limited operational flexibility due to scarce storage and reduced pipeline linepack, fragmented regional hydrogen clusters, and regulatory uncertainty affecting long-term investment decisions. These findings provide empirical input to the hydrogen network code led by the European Network of Hydrogen Network Operators and offer guidance to policymakers designing hydrogen market frameworks.
☆ Adjoint-Compatible Surrogates of the Expected Information Gain for Optimal Experimental Design in Controlled Dynamical Systems
We consider optimal experimental design for parameter estimation in dynamical systems governed by controlled ordinary differential equations. In such problems, Fisher-based criteria are attractive because they lead to time-additive objectives compatible with adjoint-based optimal control, but they remain intrinsically local and may perform poorly under strong nonlinearities or non-Gaussian prior uncertainty. By contrast, the expected information gain (EIG) provides a principled Bayesian objective, yet it is typically too costly to evaluate and does not naturally admit an adjoint-compatible formulation. In this work, we introduce adjoint-compatible surrogates of the EIG based on an exact chain-rule decomposition and tractable approximations of the posterior distribution of the unknown parameter. This leads to two surrogate criteria: an instantaneous surrogate, obtained by replacing the posterior with the prior, and a Gaussian tilting surrogate, obtained by reweighting the prior through a design-driven quadratic information factor. We also propose a multi-center tilting surrogate to improve robustness for complex or multimodal priors. We establish theoretical properties of these surrogates, including exactness of the Gaussian tilting surrogate in the linear-Gaussian setting, and illustrate their behavior on benchmark controlled dynamical systems. The results show that the proposed surrogates remain competitive in nearly Gaussian regimes and provide clearer benefits over Fisher-based designs when prior uncertainty is non-Gaussian or multimodal.
☆ Random Walks with Traversal Costs: Variance-Aware Performance Analysis and Network Optimization
We introduce weighted Markovian graphs, a random walk model that decouples the transition dynamics of a Markov chain from (random) edge weights representing the cost of traversing each edge. This decoupling allows us to study the accumulated weight along a path independently of the routing behavior. Crucially, we derive closed-form expressions for the mean and variance of weighted first passage times and weighted Kemeny constants, together with their partial derivatives with respect to both the weight and transition matrices. These results hold for both deterministic and stochastic weights with no distributional assumptions. We demonstrate the framework through two applications, highlighting the dual role of variance. In surveillance networks, we introduce the surprise index, a coefficient-of-variation metric quantifying patrol unpredictability, and show how maximizing it yields policies that are both efficient and hard to anticipate. In traffic networks subject to cascading edge failures, we develop a minimal-intervention framework that adjusts speed limits to preserve connectivity under three increasingly flexible regulatory policies.
☆ Multi-scale Metabolic Modeling and Simulation
Biological systems are governed by coupled interactions between intracellular metabolism and bioreactor operation that span multiple time scales. Constraint-based metabolic models are widely used to describe intracellular metabolism, but repeatedly solving the optimization problem at each time step in dynamic models introduces numerical challenges related to infeasibility and computational efficiency. This work presents a multi-scale modeling framework that integrates genome-scale, constraint-based metabolic models with dynamic bioreactor simulations. Intracellular metabolism is described using positive flux variables in a parsimonious flux balance analysis, and the resulting embedded optimization problem is replaced by a neural network surrogate. The surrogate provides a smooth approximation of the embedded optimization mapping and eliminates repeated linear program solves during simulation. The approach is demonstrated for fed-batch fermentation of Escherichia coli, in which the surrogate model yields intracellular fluxes under substrate-limited conditions, whereas the underlying linear program would otherwise be infeasible. The framework provides a continuous representation of intracellular metabolism suitable for dynamic simulation of genome-scale models in bioreactor configurations.
comment: To be presented at ESCAPE36, 7 pages, 6 figures
☆ Monotone 2D Integration Scheme for Mean-CVaR Optimization via Fourier-Trained Transition Kernels
We present a strictly monotone, provably convergent two-dimensional (2D) integration method for multi-period mean-conditional value-at-risk (mean-CVaR) reward-risk stochastic control in models whose one-step increment law is specified via a closed-form characteristic function (CF). When the transition density is unavailable in closed form, we learn a nonnegative, normalized 2D transition kernel in Fourier space using a simplex-constrained Gaussian-mixture parameterization, and discretize the resulting convolution integrals with composite quadrature rules with nonnegative weights to guarantee monotonicity. The scheme is implemented efficiently using 2D fast Fourier transforms. Under mild Fourier-tail decay assumptions on the CF, we derive Fourier-domain $L_2$ kernel-approximation and truncation error estimates and translate them into real-space bounds that are used to establish $\ell_\infty$-stability, consistency, and pointwise convergence as the discretization and kernel-approximation parameters vanish. Numerical experiments for a fully coupled 2D jump--diffusion model in a multi-period portfolio optimization setting illustrate robustness and accuracy.
comment: 28 pages, 3 figure
☆ A note on first-order and transversality conditions in infinite-horizon continuous-time optimal control models
We provide simple necessary and sufficient conditions under which a path constitutes a solution to an infinite-horizon, continuous-time optimal control problem. We prove transversality conditions under standard assumptions. We also present some applications to economic models.
Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach
Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which degrades convergence, and bias toward faster clients under heterogeneous data distributions. Although algorithms such as AsyncSGD and Generalized AsyncSGD mitigate this bias via client-side task queues, most existing analyses neglect the underlying queueing dynamics and lack closed-form characterizations of the update throughput and gradient staleness. To close this gap, we develop a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at the clients and the central server, as well as random uplink and downlink communication delays. Leveraging product-form network theory, we derive a closed-form expression for the update throughput, alongside closed-form upper bounds for both the communication round complexity and the expected wall-clock time required to reach an $ε$-stationary point. These results formally characterize the trade-off between gradient staleness and wall-clock convergence speed. We further extend the framework to quantify energy consumption under stochastic timing, revealing an additional trade-off between convergence speed and energy efficiency. Building on these analytical results, we propose gradient-based optimization strategies to jointly optimize routing and concurrency. Experiments on EMNIST demonstrate reductions of 29%--46% in convergence time and 36%--49% in energy consumption compared to AsyncSGD.
☆ Eliciting Von Neumann-Morgenstern utility from discrete choices with response error
We develop a preference elicitation method for a Von Neumann-Morgenstern (VNM)-type decision-maker from pairwise comparison data in the presence of response errors. We apply the maximum likelihood estimation (MLE) method to jointly elicit the non-parametric systematic VNM utility function and the scale parameter of the response error, assuming a Gumbel distribution. We incorporate structural preference information known in advance about the decision-maker's risk attitude through linear constraints on the utility function, including monotonicity, concavity, and Lipschitz continuity. Under discretely distributed lotteries, the resulting MLE problem can be reformulated as a convex program. We derive finite-sample error bounds between the MLE and the true parameters, and establish quantitative convergence of the MLE-based VNM utility function to the true utility function in the sense of the Kolmogorov distance under some conditions on the lotteries. These conditions may have potential applications in the design of efficient lotteries for preference elicitation. We further show that the optimization problem maximizing the expected MLE-based VNM utility is robust against the response error and estimation error in a probabilistic sense. Numerical experiments in a portfolio optimization application illustrate and support the theoretical results.
☆ The dual IRLS scheme for (hyper-)graph $p$-Laplacians and $\ell^p$ regression with large exponents
We introduce an iterative scheme for discrete convex minimization problems of $p$-Laplace type such as variational graph $p$-Laplace problems and $\ell^p$ regression. In each iteration, the scheme solves only a weighted least-squares problem. We verify linear convergence for suitably regularized problems and derive convergence to any prescribed tolerance.
☆ On MIP Formulations for Logit-Based Multi-Purchase Choice Models and Applications
We study logit-based multi-purchase choice models and develop an exact solution methodology for the resulting assortment optimization problems, which we show are NP-hard to approximate. We introduce a hypergraph representation that captures general bundle-based choice structures and subsumes several models in the literature, including the BundleMVL-K and multivariate MNL models (Tulabandhula et al. 2023, Jasin et al. 2024). Leveraging this representation, we derive mixed-integer programming (MIP) formulations by integrating polyhedral relaxations from multilinear optimization with a perspective reformulation of the logit choice model. Our approach preserves the strength of the underlying polyhedral relaxations, yielding formulations with provably tighter linear programming (LP) bounds than the prevalent Big-M approach. We further characterize structural conditions on the hypergraph under which the formulations are locally sharp, thereby generalizing existing LP characterizations for path-based models. The framework extends naturally to heterogeneous and robust settings. Computational experiments demonstrate that the proposed formulations significantly improve both solution quality and scalability.
☆ Achieving double-logarithmic precision dependence in optimization-based quantum unstructured search
Grover's algorithm is a fundamental quantum algorithm that achieves a quadratic speedup for unstructured search problems of size $N$. Recent studies have reformulated this task as a maximization problem on the unitary manifold and solved it via linearly convergent Riemannian gradient ascent (RGA) methods, resulting in a complexity of $O(\sqrt{N}\log (1/\varepsilon))$. In this work, we adopt the Riemannian modified Newton (RMN) method to solve the quantum search problem. We show that, in the setting of quantum search, the Riemannian Newton direction is collinear with the Riemannian gradient in the sense that the Riemannian gradient is always an eigenvector of the corresponding Riemannian Hessian. As a result, without additional overhead, the proposed RMN method numerically achieves a quadratic convergence rate with respect to error $\varepsilon$, implying a complexity of $O(\sqrt{N}\log\log (1/\varepsilon))$, which is double-logarithmic in precision. Furthermore, our approach remains Grover-compatible, namely, it relies exclusively on the standard Grover oracle and diffusion operators to ensure algorithmic implementability, and its parameter update process can be efficiently precomputed on classical computers.
comment: 15 pages, 5 figures
♻ ☆ Gradient descent for unbounded convex functions on Hadamard manifolds and its applications to scaling problems
In this paper, we study the asymptotic behavior of continuous- and discrete-time gradient flows of a ``lower-unbounded" convex function $f$ on a Hadamard manifold $M$, particularly, their convergence properties to the boundary $M^{\infty}$ at infinity of $M$. We establish a duality theorem that the infimum of the gradient-norm $\|\nabla f(x)\|$ of $f$ over $M$ is equal to the supremum of the negative of the recession function $f^{\infty}$ of $f$ over the boundary $M^{\infty}$, provided the infimum is positive. Further, the infimum and the supremum are obtained by the limit of the gradient flow of $f$. Our results feature convex-optimization ingredients of the moment-weight inequality for reductive group actions by Georgoulas, Robbin, and Salamon, and are applied to noncommutative optimization by Bürgisser et al. FOCS 2019. We show that gradient descent of the Kempf-Ness function for an unstable orbit converges to a destabilizing 1-parameter subgroup in the Hilbert-Mumford criterion, and the associated moment-map sequence converges to the minimum-norm point of the moment polytope. We show further refinements for operator scaling -- the left-right action on a matrix tuple $A= (A_1,A_2,\ldots,A_N)$. We characterize the gradient-flow limit of operator scaling by a vector-space generalization of the classical Dulmage-Mendelsohn decomposition of a bipartite graph. For a special case of $N = 2$, we reveal that the limit determines the Kronecker canonical form of a matrix pencil $s A_1+A_2$.
comment: The conference version in FOCS 2024; to appear in Mathematics of Operations Research
♻ ☆ A polynomial approximation scheme for nonlinear model reduction by moment matching
We propose a procedure for the numerical approximation of invariance equations arising in the moment matching technique associated with reduced-order modeling of high-dimensional dynamical systems. The Galerkin residual method is employed to find an approximate solution to the invariance equation using a Newton iteration on the coefficients of a monomial basis expansion of the solution. These solutions to the invariance equations can then be used to construct reduced-order models. We assess the ability of the method to solve the invariance PDE system as well as to achieve moment matching and recover the steady-state behaviour of nonlinear systems with state dimension of order 1000 driven by linear and nonlinear signal generators.
comment: 21 pages, 4 figures, submitted to SIAM Journal on Scientific Computing
♻ ☆ Human-in-the-loop: Real-time Preference Optimization
Optimization with preference feedback is an active research area with many applications in engineering systems where humans play a central role, such as building control and autonomous vehicles. While most existing studies focus on optimizing a static user utility, few have investigated its closed-loop behavior that accounts for system transients. In this work, we propose an online feedback optimization controller that optimizes user utility using pairwise comparison feedback with both optimality and closed-loop stability guarantees. By adding a random exploration signal, the controller estimates the descent direction based on the binary comparison feedback between two consecutive time steps. We analyze its closed-loop behavior when interacting with a nonlinear plant and show that, under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are further validated through numerical experiments.
♻ ☆ A CAV-based perimeter-free regional traffic control strategy utilizing existing parking infrastructure
This paper proposes a novel perimeter-free regional traffic management strategy for networks under a connected and autonomous vehicle (CAV) environment. The proposed strategy requires a subset of CAVs to temporarily wait at nearby parking facilities when the network is congested. After a designated holding time, these CAVs are allowed to re-enter the network. Doing so helps reduce congestion and improve overall operational efficiency. Unlike traditional perimeter control approaches, the proposed strategy leverages existing parking infrastructure to temporarily hold vehicles in a way that partially avoids local queue accumulation issues. Further, holding the vehicles with the longest remaining travel distances creates a self-reinforcing mechanism which helps reduce congestion more quickly than perimeter metering control. Simulation results show that the proposed strategy not only reduces travel time for vehicles that are not held, but can also reduce travel times for some of the held vehicles as well. Importantly, its performance has been demonstrated under various configurations of parking locations and capacities and CAV penetration rates.
♻ ☆ NeST-BO: Fast Local Bayesian Optimization via Newton-Step Targeting of Gradient and Hessian Information
Bayesian optimization (BO) is effective for expensive black-box problems but remains challenging in high dimensions. We propose NeST-BO, a curvature-aware local BO method that targets a (modified) Newton step by jointly learning gradient and Hessian information with Gaussian process (GP) surrogates, and selecting evaluations via a one-step lookahead bound on the Newton-step error. We show that this bound contracts with batch size, so NeST-BO drives the step error to zero; in well-behaved neighborhoods it recovers the fast local convergence behavior of inexact/modified Newton methods, while standard safeguards support global convergence to stationary points. To improve scaling with problem dimension, we optimize the acquisition in low-dimensional embedded subspaces (random or learned), reducing the dominant cost of learning curvature from $O(d^2)$ to $O(m^2)$ with $m \ll d$ while preserving step targeting. Across high-dimensional synthetic and real-world problems, including cases with thousands of variables and unknown active subspaces, NeST-BO consistently yields faster convergence and better final values than state-of-the-art local and high-dimensional BO baselines.
♻ ☆ Sensitivity-Based Distributed Programming for Non-Convex Optimization
This paper presents a novel sensitivity-based distributed programming (SBDP) approach for non-convex, large-scale nonlinear programs (NLP). The algorithm relies on first-order sensitivities to cooperatively solve the central NLP in a distributed manner with only neighbor-to-neighbor communication and parallelizable local computations. The decoupling of the subsystems is based on primal decomposition. We derive sufficient local convergence conditions for non-convex problems. Furthermore, we consider the SBDP method in a distributed optimal control context and derive favorable convergence properties in this setting. We illustrate these theoretical findings and the performance of the proposed method with a comparison to state-of-the-art algorithms and simulations of various distributed optimization and control problems.
♻ ☆ Robust Optimality of Bundling Goods Beyond Finite Variance
When selling many goods with independent valuations, we develop a distributionally robust framework, consisting of a two-player game between seller and nature. The seller has only limited knowledge about the value distribution. The seller selects a revenue-maximizing mechanism, after which nature chooses a revenue-minimizing distribution from all distributions that comply with the limited knowledge. When the seller knows the mean and variance of valuations, bundling is known to be an asymptotically optimal deterministic mechanism, achieving a normalized revenue close to the mean. Moving beyond this variance assumption, we assume knowledge of the mean absolute deviation (MAD), accommodating more dispersion and heavy-tailed valuations with infinite variance. We show for a large range of MAD values that bundling remains optimal, but the seller can only guarantee a revenue strictly smaller than the mean. Another noteworthy finding is indifference to the order of play, as both the max-min and min-max versions of the problem yield identical values. This contrasts with deterministic mechanisms and the separate sale of goods, where the order of play significantly impacts outcomes. We further underscore the universality of the optimal bundling price by demonstrating its efficacy in optimizing not only absolute revenue but also the absolute regret and ratio objective among all bundling prices
♻ ☆ Accelerated optimization algorithms and ordinary differential equations: the convex non Euclidean case
We study the connections between ordinary differential equations and optimization algorithms in a non-Euclidean setting. We propose a novel accelerated algorithm for minimising convex functions over a convex constrained set. This algorithm is a natural generalization of Nesterov's accelerated gradient descent method to the non-Euclidean setting and can be interpreted as an additive Runge-Kutta algorithm. The algorithm can also be derived as a numerical discretization of the ODE appearing in Krichene et al. (2015a). We use Lyapunov functions to establish convergence rates for the ODE and show that the discretizations considered achieve acceleration beyond the setting studied in Krichene et al. (2015a). Finally, we discuss how the proposed algorithm connects to various equations and algorithms in the literature.
comment: 25 pages, 5 figures
♻ ☆ Small-time estimates for a real moment problem with two-term Weyl spectral law
In this work, we first study the solvability of moment problems involving real exponentials and provide explicit estimates of the associated control cost. The result holds when the increasing sequence of distinct real numbers satisfies a suitable two-term Weyl asymptotic law, without imposing any uniform spacing condition on blocks of its elements. We then deduce a corresponding controllability result for a linear control problem. Next, we present an exponential family fitting our hypotheses that cannot be treated by existing results of this type. Finally, we show how to deduce new exact controllability results for suitable fractional bilinear heat equations in higher-dimensional domains.
comment: 26 pages, 0 figure
♻ ☆ Framing structural identifiability in terms of parameter symmetries
A key step in mechanistic modelling of dynamical systems is to conduct a structural identifiability analysis. This entails deducing which parameter combinations can be estimated from a given set of observed outputs. The standard differential algebra approach answers this question by re-writing the model as a higher-order system of ordinary differential equations that depends solely on the observed outputs. Over the last decades, alternative approaches for analysing structural identifiability based on Lie symmetries acting on independent and dependent variables as well as parameters, have been proposed. However, the link between the standard differential algebra approach and that using full symmetries remains elusive. In this work, we establish this link by introducing the notion of parameter symmetries, which are a special type of full symmetry that alter parameters while preserving the observed outputs. Our main result states that a parameter combination is locally structurally identifiable if and only if it is a differential invariant of all parameter symmetries of a given model. We show that the standard differential algebra approach is consistent with the concept of structural identifiability in terms of parameter symmetries. We present an alternative symmetry-based approach for analysing structural identifiability using parameter symmetries. Lastly, we demonstrate our approach on two well-known models in mathematical biology.
comment: 45 pages, 2 figures
♻ ☆ Distributionally robust monopoly pricing: Switching from low to high prices in volatile markets
Problem definition: Traditional monopoly pricing assumes sellers have full information about consumer valuations. We consider monopoly pricing under limited information, where a seller only knows the mean, variance and support of the valuation distribution. The objective is to maximize expected revenue by selecting the optimal fixed price. Methodology/results: We adopt a distributionally robust framework, where the seller considers all valuation distributions that comply with the limited information. We formulate a maximin problem which seeks to maximize expected revenue for the worst-case valuation distribution. The minimization problem that identifies the worst-case valuation distribution is solved using primal-dual methods, and in turn leads to an explicitly solvable maximization problem. This yields a closed-form optimal pricing policy and a new fundamental principle prescribing when to use low and high robust prices. Managerial implications: We show that the optimal policy switches from low to high prices when variance becomes sufficiently large, yielding significant performance gains compared with existing robust prices that generally decay with market uncertainty. This presents guidelines for when the seller should switch from targeting mass markets to niche markets. Similar guidelines are obtained for delay-prone services with rational utility-maximizing customers, underlining the universality and wide applicability of the novel pricing policy.
♻ ☆ CACTO-SL: Using Sobolev Learning to improve Continuous Actor-Critic with Trajectory Optimization
Trajectory Optimization (TO) and Reinforcement Learning (RL) are powerful and complementary tools to solve optimal control problems. On the one hand, TO can efficiently compute locally-optimal solutions, but it tends to get stuck in local minima if the problem is not convex. On the other hand, RL is typically less sensitive to non-convexity, but it requires a much higher computational effort. Recently, we have proposed CACTO (Continuous Actor-Critic with Trajectory Optimization), an algorithm that uses TO to guide the exploration of an actor-critic RL algorithm. In turns, the policy encoded by the actor is used to warm-start TO, closing the loop between TO and RL. In this work, we present an extension of CACTO exploiting the idea of Sobolev learning. To make the training of the critic network faster and more data efficient, we enrich it with the gradient of the Value function, computed via a backward pass of the differential dynamic programming algorithm. Our results show that the new algorithm is more efficient than the original CACTO, reducing the number of TO episodes by a factor ranging from 3 to 10, and consequently the computation time. Moreover, we show that CACTO-SL helps TO to find better minima and to produce more consistent results.
♻ ☆ Infinite-Time Mean Field FBSDEs and Viscosity Solutions to Elliptic Master Equations
This paper presents a further investigation of the properties of infinite-time mean field FBSDEs and elliptic master equations, which were introduced in [14] as mathematical tools for solving discounted infinite-time mean field games. By establishing the continuous dependence of the FBSDE solutions on their initial values, we prove the flow property of the mean field FBSDEs. And then, we prove that, at the Nash equilibrium, the value function of the representative player constitutes a viscosity solution to the corresponding elliptic master equation. Furthermore, we establish the uniqueness of classical solutions to the elliptic master equation under displacement monotonicity and certain growth assumptions.
♻ ☆ Unsafe Probabilities and Risk Contours for Stochastic Processes using Convex Optimization
This paper proposes an algorithm to calculate the maximal probability of unsafety with respect to trajectories of a stochastic process and a hazard set. The unsafe probability estimation problem is cast as a primal-dual pair of infinite-dimensional linear programs in occupation measures and continuous functions. This convex relaxation is nonconservative (to the true probability of unsafety) under compactness and regularity conditions in dynamics. The continuous-function linear program is linked to existing probability-certifying barrier certificates of safety. Risk contours for initial conditions of the stochastic process may be generated by suitably modifying the objective of the continuous-function program, forming an interpretable and visual representation of stochastic safety for test initial conditions. All infinite-dimensional linear programs are truncated to finite dimension by the Moment-Sum-of-Squares hierarchy of semidefinite programs. Unsafe-probability estimation and risk contours are generated for example stochastic processes.
comment: 18 pages, 5 figures, 2 tables
♻ ☆ The Subspace Flatness Conjecture and Faster Integer Programming
In a seminal paper, Kannan and Lovász (1988) considered a quantity $μ_{KL}(Λ,K)$ which denotes the best volume-based lower bound on the covering radius $μ(Λ,K)$ of a convex body $K$ with respect to a lattice $Λ$. Kannan and Lovász proved that $μ(Λ,K) \leq n \cdot μ_{KL}(Λ,K)$ and the Subspace Flatness Conjecture by Dadush (2012) claims a $O(\log(2n))$ factor suffices, which would match the lower bound from the work of Kannan and Lovász. We settle this conjecture up to a constant in the exponent by proving that $μ(Λ,K) \leq O(\log^{3}(2n)) \cdot μ_{KL} (Λ,K)$. Our proof is based on the Reverse Minkowski Theorem due to Regev and Stephens-Davidowitz (2017). Following the work of Dadush (2012, 2019), we obtain a $(\log(2n))^{O(n)}$-time randomized algorithm to solve integer programs in $n$ variables. Another implication of our main result is a near-optimal flatness constant of $O(n \log^{2}(2n))$, improving on the previous bound of $O(n^{4/3} \log^{O(1)} (2n))$.
comment: 49 pages
♻ ☆ Stabilizing a linear system using phone calls when time is information
We consider the problem of stabilizing an undisturbed, scalar, linear system over a "timing" channel, namely a channel where information is communicated through the timestamps of the transmitted symbols. Each symbol transmitted from a sensor to a controller in a closed-loop system is received subject to some to random delay. The sensor can encode messages in the waiting times between successive transmissions and the controller must decode them from the inter-reception times of successive symbols. This set-up is analogous to a telephone system where a transmitter signals a phone call to a receiver through a "ring" and, after the random delay required to establish the connection; the receiver is aware of the "ring" being received. Since there is no data payload exchange between the sensor and the controller, this set-up provides an abstraction for performing event-triggering control with zero-payload rate. We show the following requirement for stabilization: for the state of the system to converge to zero in probability, the timing capacity of the channel should be, essentially, at least as large as the entropy rate of the system. Conversely, in the case the symbol delays are exponentially distributed, we show an "almost" tight sufficient condition using a coding strategy that refines the estimate of the decoded message every time a new symbol is received. Our results generalize previous zero-payload event-triggering control strategies, revealing a fundamental limit in using timing information for stabilization, independent of any transmission strategy.
♻ ☆ Bayesian Optimization on Networks
This paper studies optimization on networks modeled as metric graphs. Motivated by applications where the objective function is expensive to evaluate or only available as a black box, we develop Bayesian optimization algorithms that sequentially update a Gaussian process surrogate model of the objective to guide the acquisition of query points. To ensure that the surrogates are tailored to the network's geometry, we adopt Whittle-Matérn Gaussian process prior models defined via stochastic partial differential equations on metric graphs. In addition to establishing regret bounds for optimizing sufficiently smooth objective functions, we analyze the practical case in which the smoothness of the objective is unknown and the Whittle-Matérn prior is represented using finite elements. Numerical results demonstrate the effectiveness of our algorithms for optimizing benchmark objective functions on a synthetic metric graph and for Bayesian inversion via maximum a posteriori estimation on a telecommunication network.
comment: 40 pages, 10 figures; includes appendices
♻ ☆ Introduction to Online Control
This text presents an introduction to an emerging paradigm in control of dynamical systems and differentiable reinforcement learning called online nonstochastic control. The new approach applies techniques from online convex optimization and convex relaxations to obtain new methods with provable guarantees for classical settings in optimal and robust control. The primary distinction between online nonstochastic control and other frameworks is the objective. In optimal control, robust control, and other control methodologies that assume stochastic noise, the goal is to perform comparably to an offline optimal strategy. In online nonstochastic control, both the cost functions as well as the perturbations from the assumed dynamical model are chosen by an adversary. Thus the optimal policy is not defined a priori. Rather, the target is to attain low regret against the best policy in hindsight from a benchmark class of policies. This objective suggests the use of the decision making framework of online convex optimization as an algorithmic methodology. The resulting methods are based on iterative mathematical optimization algorithms, and are accompanied by finite-time regret and computational complexity guarantees.
comment: Draft; comments/suggestions welcome at nonstochastic.control@gmail.com