Story

DextrAH-G presents the following contributions

Efficient sim2real scaling for vectorized RL training
Using geometry fabrics for safe policies

Problems

each object is specified by a one-hot embedding for the teacher
different random seeds can result in vastly different learned behaviors, from which the authors need to handpick the best one, meaning that there is no replacement for real-world testing (gotta train multiple teacher policies and critics, each with different behavior)

Geometry Fabrics

geometric fabrics generalize the behavior of classical mechanical systems and, thereby, can be used to model controllers with design flexibility, composability, and stability without the loss of modeling fidelity. Behavior expressed by a geometric fabric follows the form

\[M_f(q_f, \dot{q}_f)\ddot{q}_f+f_f(q_f, \dot{q}_f)+f_\pi(a)=0\]

where $M_f\in\mathbb{R}^{n\times n}$ is the positive-definite system metric (mass), which captures system prioritization, $f_f\in\mathbb{R}^n$ is a nominal path generating geometric force, and $f_\pi(a)\in\mathbb{R}^n$ is an additional driving force of some action $a\in\mathbb{R}^m$. $q_f, \dot{q}_f, \ddot{f}\in\mathbb{R}^n$ are the position, velocity, and acceleration of the fabric.

Dataset

trained on 140 diverse objects in Visual Dexterity object dataset object meshes are preprocessed by computing the mesh centroid using Trimesh and transforming the vertices such that the new mesh centroid is at the origin, so that the object positions given to the simulator would be exactly at the object centroids

Methods

Asymmetric Actor Critic training

Teacher-privileged training

The critic $V(s)$ receives privileged state information $s$, and the teacher policy $\pi_{\text{privileged}}(o_{\text{privileged}})$ is given an observation
$o_{\text{privileged}} = [\,o_{\text{robot}},\;x_{\text{goal}},\;o_{\text{obj}}\,]$.

(o_{\text{robot}}) contains
- configuration position (q \in \mathbb{R}^{N_q}),
- configuration velocity (\dot q \in \mathbb{R}^{N_q}) ((N_q = 23)),
- positions of three palm points ([\,x_{\text{palm}},\;x_{\text{palm-x}},\;x_{\text{palm-y}}\,] \in \mathbb{R}^{3\times 3}),
- fingertip positions (x_{\text{fingertips}} \in \mathbb{R}^{N_{\text{fingers}}\times 3}) with (N_{\text{fingers}} = 4),
- fabric state ([\,q_f,\;\dot q_f,\;\ddot q_f\,] \in \mathbb{R}^{3\times N_q}).
(x_{\text{goal}} \in \mathbb{R}^3) is the goal object position.
(o_{\text{obj}}) contains
- noisy object position (\tilde x_{\text{obj}} \in \mathbb{R}^3),
- noisy quaternion (\tilde q_{\text{obj}} \in \mathbb{R}^4),
- one-hot object embedding (e \in {0,1}^{N_{\text{objects}}}) with (N_{\text{objects}} = 140).

The critic’s input state is
(s = [\,o_{\text{privileged}},\;s_{\text{privileged}}\,]).
(s_{\text{privileged}}) adds

joint forces (f_{\text{dof}} \in \mathbb{R}^{N_q}),
fingertip contact forces (f_{\text{fingers}} \in \mathbb{R}^{N_{\text{fingers}}\times 3}),
true object position (x_{\text{obj}} \in \mathbb{R}^3),
true quaternion (q_{\text{obj}} \in \mathbb{R}^4),
true linear velocity (v_{\text{obj}} \in \mathbb{R}^3),
true angular velocity (w_{\text{obj}} \in \mathbb{R}^3).

The teacher-policy action (a) drives the geometric fabric:

\[a = [\,x_{f,\text{target}},\;r_{f,\text{target}},\;x_{\text{pca},\text{target}}\,] \in \mathbb{R}^{11},\]

where (x_{f,\text{target}} \in \mathbb{R}^3) is the target palm position,
(r_{f,\text{target}} \in \mathbb{R}^3) is the target palm orientation (Euler), and
(x_{\text{pca},\text{target}} \in \mathbb{R}^5) is the target PCA vector for the fingers.
The fabric integrates at 60 Hz, the simulation also steps at 60 Hz, and the teacher runs at 15 Hz (actions are repeated between updates).

Teacher network: MLP + LSTM with skip connections.
Critic network: pure MLP (has privileged state, so no temporal memory needed).

Student distillation

The distilled student policy consumes depth images at 15 Hz for reactive, real-world grasping.
During distillation the student

[ \pi_{\text{depth}}(o_{\text{depth}}) \;\longrightarrow\; (\,\hat a,\;\hat x_{\text{obj}}\,) ]

receives
(o_{\text{depth}} = [\,o_{\text{robot}},\;x_{\text{goal}},\;I\,]),
where (I \in [0.5,1.5]^{160\times 120}\,\text{m}) is a raw depth image.
The student outputs actions (\hat a \in \mathbb{R}^{11}) and object-position estimates (\hat x_{\text{obj}} \in \mathbb{R}^{3}).

The training loss is

[ \mathcal{L} \;=\; \mathcal{L}{\text{action}} \;+\; \beta\,\mathcal{L}{\text{pos}}, \qquad \begin{aligned} \mathcal{L}{\text{action}} &= \lVert \hat a - a \rVert_2,
\mathcal{L}{\text{pos}} &= \lVert \hat x_{\text{obj}} - x_{\text{obj}} \rVert_2. \end{aligned} ]

Here (a) comes from the teacher (\pi_{\text{privileged}}) and (x_{\text{obj}}) is the ground-truth simulator pose.

(o_{\text{robot}}) and (x_{\text{goal}}) → 3-layer MLPs ((512,256,128)) with ELU.
Image (I) → three conv layers (depth 16 → 32 → 64, kernel 3, stride 1, pad 1) + max-pools, then MLP ((128,128)) with ReLU.
All embeddings concat → 1-layer GRU (1024 units) → predicts (\hat a).

Rewards

Total reward is a weighted sum (r=\sum_i w_i r_i).
For any error signal (e) we define a stateful improvement reward

[ \mathrm{minimize}(e) \;=\; \max{\,e_{\text{smallest}} - e,\;0}. ]

A positive reward is given only when the current error becomes smaller than all previous errors in the episode, after which (e_{\text{smallest}}) is updated.

(You can insert your diagram here.)

The episode terminates if the object falls off the table, the success reward is earned, or the time limit is reached.

Domain randomization

Random wrench perturbations – apply random forces/torques to the object to encourage robust grasps.
Pose noise – add both correlated and uncorrelated noise to object pose; the hand learns to approach more open and tolerant of uncertainty.
Friction reduction – lower the object’s friction to (\mu = 0.7), discouraging grasps that rely too heavily on friction.
Parameter randomization – vary simulation parameters broadly to improve real-world transfer.