Story

DextrAH-G presents the following contributions

  1. Efficient sim2real scaling for vectorized RL training
  2. Using geometry fabrics for safe policies

Problems

  1. each object is specified by a one-hot embedding for the teacher
  2. different random seeds can result in vastly different learned behaviors, from which the authors need to handpick the best one, meaning that there is no replacement for real-world testing (gotta train multiple teacher policies and critics, each with different behavior)

Geometry Fabrics

geometric fabrics generalize the behavior of classical mechanical systems and, thereby, can be used to model controllers with design flexibility, composability, and stability without the loss of modeling fidelity. Behavior expressed by a geometric fabric follows the form

\[M_f(q_f, \dot{q}_f)\ddot{q}_f+f_f(q_f, \dot{q}_f)+f_\pi(a)=0\]

where $M_f\in\mathbb{R}^{n\times n}$ is the positive-definite system metric (mass), which captures system prioritization, $f_f\in\mathbb{R}^n$ is a nominal path generating geometric force, and $f_\pi(a)\in\mathbb{R}^n$ is an additional driving force of some action $a\in\mathbb{R}^m$. $q_f, \dot{q}_f, \ddot{f}\in\mathbb{R}^n$ are the position, velocity, and acceleration of the fabric.

Dataset

trained on 140 diverse objects in Visual Dexterity object dataset object meshes are preprocessed by computing the mesh centroid using Trimesh and transforming the vertices such that the new mesh centroid is at the origin, so that the object positions given to the simulator would be exactly at the object centroids

Methods

Asymmetric Actor Critic training

Teacher-privileged training

The critic $V(s)$ receives privileged state information $s$, and the teacher policy $\pi_{\text{privileged}}(o_{\text{privileged}})$ is given an observation
$o_{\text{privileged}} = [\,o_{\text{robot}},\;x_{\text{goal}},\;o_{\text{obj}}\,]$.

The critic’s input state is
(s = [\,o_{\text{privileged}},\;s_{\text{privileged}}\,]).
(s_{\text{privileged}}) adds

The teacher-policy action (a) drives the geometric fabric:

\[a = [\,x_{f,\text{target}},\;r_{f,\text{target}},\;x_{\text{pca},\text{target}}\,] \in \mathbb{R}^{11},\]

where (x_{f,\text{target}} \in \mathbb{R}^3) is the target palm position,
(r_{f,\text{target}} \in \mathbb{R}^3) is the target palm orientation (Euler), and
(x_{\text{pca},\text{target}} \in \mathbb{R}^5) is the target PCA vector for the fingers.
The fabric integrates at 60 Hz, the simulation also steps at 60 Hz, and the teacher runs at 15 Hz (actions are repeated between updates).


Student distillation

The distilled student policy consumes depth images at 15 Hz for reactive, real-world grasping.
During distillation the student

[ \pi_{\text{depth}}(o_{\text{depth}}) \;\longrightarrow\; (\,\hat a,\;\hat x_{\text{obj}}\,) ]

receives
(o_{\text{depth}} = [\,o_{\text{robot}},\;x_{\text{goal}},\;I\,]),
where (I \in [0.5,1.5]^{160\times 120}\,\text{m}) is a raw depth image.
The student outputs actions (\hat a \in \mathbb{R}^{11}) and object-position estimates (\hat x_{\text{obj}} \in \mathbb{R}^{3}).

The training loss is

[ \mathcal{L} \;=\; \mathcal{L}{\text{action}} \;+\; \beta\,\mathcal{L}{\text{pos}}, \qquad \begin{aligned} \mathcal{L}{\text{action}} &= \lVert \hat a - a \rVert_2,
\mathcal{L}
{\text{pos}} &= \lVert \hat x_{\text{obj}} - x_{\text{obj}} \rVert_2. \end{aligned} ]

Here (a) comes from the teacher (\pi_{\text{privileged}}) and (x_{\text{obj}}) is the ground-truth simulator pose.


Rewards

Total reward is a weighted sum (r=\sum_i w_i r_i).
For any error signal (e) we define a stateful improvement reward

[ \mathrm{minimize}(e) \;=\; \max{\,e_{\text{smallest}} - e,\;0}. ]

A positive reward is given only when the current error becomes smaller than all previous errors in the episode, after which (e_{\text{smallest}}) is updated.

(You can insert your diagram here.)

The episode terminates if the object falls off the table, the success reward is earned, or the time limit is reached.


Domain randomization

Tags