generative adversarial imitation learning to enable training of generic neural network policies to produce humanlike movement patterns from limited demonstrations consisting only of partially observed state features, without access to actions,

build sub-skill policies from motion capture data and show that they can be reused to solve tasks when controlled by a higher level controller.

简单的reward function 难以让机器人学到类似人体的动作。

本文提出 生成对抗模仿学习的方法,从有限的局部可观察的状态中学到类人的动作模式,而不用直接展示动作信息。

此外,采用两个不同层次的决策模块,使得低级控制器学到的动作能够被高级控制器复用,即高级控制器能够发送指令使低级控制器输出不同动作。

Methods that rely on pure rein- forcement learning (RL) objectives tend to produce insufficiently humanlike and overly stereotyped movement behaviors. The uncanniness of these movements can be improved with meticulous body design (e.g. muscles), controller specialization (e.g. phase variables), and reward function engineer- ing, but these methods require substantial domain expertise. Designing reward functions to capture the intricacies of humanoid behavior is difficult and must be repeated for every new behavior.

  • perform imitation learning from motion capture data >

State-action pairs from demonstration data are compared against state-action pairs from the policy. A classifier is trained to discriminate demonstration data from the imitation data, and the imitation policy is given reward for fooling the discriminator.

In this work, we present a pipeline for

  • (1) training low-level controllers to produce behaviors from motion capture using an extension of GAIL;
  • (2) embedding the low-level controllers into larger control systems wherein a high-level controller learns by RL to modulate the low-level controller to solve new tasks

GAIL

The acquisition of multiple behaviors from noisy motion capture data (“real-to-sim") requires two extensions to the GAIL framework. The original presentation of GAIL was restricted to imitation of single skills from complete state-action trajectories, where the demonstrator shared the same body and policy parameterization as the imitator. We demonstrate: (a) partial state featurizations without demonstrator actions suffice for adversarial imitation; (b) the body structure and physical parameters (i.e. body dynamics) need not match between the demonstrator and the imitator; and (c) robust transitions between behaviors naturally emerge by training on multiple behaviors.

GAIL, encourages the imitator to match the state-occupancy distribution of the demonstrator.

demonstration data were generated by training a first policy via RL on a task with a hand-designed reward function, logging trajectories, and training a second policy of the same architecture as the first to imitate it by GAIL. While impressive, those results constitute a validation of the algorithm in the most favorable setting – the same body, simulator, and policy architecture are used to produce demonstrations as are used for imitation.

results matching ""

    No results matching ""