|
# Example Learning Environments |
|
|
|
<img src="../images/example-envs.png" align="middle" width="3000"/> |
|
|
|
The Unity ML-Agents Toolkit includes an expanding set of example environments |
|
that highlight the various features of the toolkit. These environments can also |
|
serve as templates for new environments or as ways to test new ML algorithms. |
|
Environments are located in `Project/Assets/ML-Agents/Examples` and summarized |
|
below. |
|
|
|
For the environments that highlight specific features of the toolkit, we provide |
|
the pre-trained model files and the training config file that enables you to |
|
train the scene yourself. The environments that are designed to serve as |
|
challenges for researchers do not have accompanying pre-trained model files or |
|
training configs and are marked as _Optional_ below. |
|
|
|
This page only overviews the example environments we provide. To learn more on |
|
how to design and build your own environments see our |
|
[Making a New Learning Environment](Learning-Environment-Create-New.md) page. If |
|
you would like to contribute environments, please see our |
|
[contribution guidelines](CONTRIBUTING.md) page. |
|
|
|
## Basic |
|
|
|
 |
|
|
|
- Set-up: A linear movement task where the agent must move left or right to |
|
rewarding states. |
|
- Goal: Move to the most reward state. |
|
- Agents: The environment contains one agent. |
|
- Agent Reward Function: |
|
- -0.01 at each step |
|
- +0.1 for arriving at suboptimal state. |
|
- +1.0 for arriving at optimal state. |
|
- Behavior Parameters: |
|
- Vector Observation space: One variable corresponding to current state. |
|
- Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move |
|
right). |
|
- Visual Observations: None |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 0.93 |
|
|
|
## 3DBall: 3D Balance Ball |
|
|
|
 |
|
|
|
- Set-up: A balance-ball task, where the agent balances the ball on it's head. |
|
- Goal: The agent must balance the ball on it's head for as long as possible. |
|
- Agents: The environment contains 12 agents of the same kind, all using the |
|
same Behavior Parameters. |
|
- Agent Reward Function: |
|
- +0.1 for every step the ball remains on it's head. |
|
- -1.0 if the ball falls off. |
|
- Behavior Parameters: |
|
- Vector Observation space: 8 variables corresponding to rotation of the agent |
|
cube, and position and velocity of ball. |
|
- Vector Observation space (Hard Version): 5 variables corresponding to |
|
rotation of the agent cube and position of ball. |
|
- Actions: 2 continuous actions, with one value corresponding to |
|
X-rotation, and the other to Z-rotation. |
|
- Visual Observations: Third-person view from the upper-front of the agent. Use |
|
`Visual3DBall` scene. |
|
- Float Properties: Three |
|
- scale: Specifies the scale of the ball in the 3 dimensions (equal across the |
|
three dimensions) |
|
- Default: 1 |
|
- Recommended Minimum: 0.2 |
|
- Recommended Maximum: 5 |
|
- gravity: Magnitude of gravity |
|
- Default: 9.81 |
|
- Recommended Minimum: 4 |
|
- Recommended Maximum: 105 |
|
- mass: Specifies mass of the ball |
|
- Default: 1 |
|
- Recommended Minimum: 0.1 |
|
- Recommended Maximum: 20 |
|
- Benchmark Mean Reward: 100 |
|
|
|
## GridWorld |
|
|
|
 |
|
|
|
- Set-up: A multi-goal version of the grid-world task. Scene contains agent, goal, |
|
and obstacles. |
|
- Goal: The agent must navigate the grid to the appropriate goal while |
|
avoiding the obstacles. |
|
- Agents: The environment contains nine agents with the same Behavior |
|
Parameters. |
|
- Agent Reward Function: |
|
- -0.01 for every step. |
|
- +1.0 if the agent navigates to the correct goal (episode ends). |
|
- -1.0 if the agent navigates to an incorrect goal (episode ends). |
|
- Behavior Parameters: |
|
- Vector Observation space: None |
|
- Actions: 1 discrete action branch with 5 actions, corresponding to movement in |
|
cardinal directions or not moving. Note that for this environment, |
|
[action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions) |
|
is turned on by default (this option can be toggled using the `Mask Actions` |
|
checkbox within the `trueAgent` GameObject). The trained model file provided |
|
was generated with action masking turned on. |
|
- Visual Observations: One corresponding to top-down view of GridWorld. |
|
- Goal Signal : A one hot vector corresponding to which color is the correct goal |
|
for the Agent |
|
- Float Properties: Three, corresponding to grid size, number of green goals, and |
|
number of red goals. |
|
- Benchmark Mean Reward: 0.8 |
|
|
|
## Push Block |
|
|
|
 |
|
|
|
- Set-up: A platforming environment where the agent can push a block around. |
|
- Goal: The agent must push the block to the goal. |
|
- Agents: The environment contains one agent. |
|
- Agent Reward Function: |
|
- -0.0025 for every step. |
|
- +1.0 if the block touches the goal. |
|
- Behavior Parameters: |
|
- Vector Observation space: (Continuous) 70 variables corresponding to 14 |
|
ray-casts each detecting one of three possible objects (wall, goal, or |
|
block). |
|
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
|
and counterclockwise, move along four different face directions, or do nothing. |
|
- Float Properties: Four |
|
- block_scale: Scale of the block along the x and z dimensions |
|
- Default: 2 |
|
- Recommended Minimum: 0.5 |
|
- Recommended Maximum: 4 |
|
- dynamic_friction: Coefficient of friction for the ground material acting on |
|
moving objects |
|
- Default: 0 |
|
- Recommended Minimum: 0 |
|
- Recommended Maximum: 1 |
|
- static_friction: Coefficient of friction for the ground material acting on |
|
stationary objects |
|
- Default: 0 |
|
- Recommended Minimum: 0 |
|
- Recommended Maximum: 1 |
|
- block_drag: Effect of air resistance on block |
|
- Default: 0.5 |
|
- Recommended Minimum: 0 |
|
- Recommended Maximum: 2000 |
|
- Benchmark Mean Reward: 4.5 |
|
|
|
## Wall Jump |
|
|
|
 |
|
|
|
- Set-up: A platforming environment where the agent can jump over a wall. |
|
- Goal: The agent must use the block to scale the wall and reach the goal. |
|
- Agents: The environment contains one agent linked to two different Models. The |
|
Policy the agent is linked to changes depending on the height of the wall. The |
|
change of Policy is done in the WallJumpAgent class. |
|
- Agent Reward Function: |
|
- -0.0005 for every step. |
|
- +1.0 if the agent touches the goal. |
|
- -1.0 if the agent falls off the platform. |
|
- Behavior Parameters: |
|
- Vector Observation space: Size of 74, corresponding to 14 ray casts each |
|
detecting 4 possible objects. plus the global position of the agent and |
|
whether or not the agent is grounded. |
|
- Actions: 4 discrete action branches: |
|
- Forward Motion (3 possible actions: Forward, Backwards, No Action) |
|
- Rotation (3 possible actions: Rotate Left, Rotate Right, No Action) |
|
- Side Motion (3 possible actions: Left, Right, No Action) |
|
- Jump (2 possible actions: Jump, No Action) |
|
- Visual Observations: None |
|
- Float Properties: Four |
|
- Benchmark Mean Reward (Big & Small Wall): 0.8 |
|
|
|
## Crawler |
|
|
|
 |
|
|
|
- Set-up: A creature with 4 arms and 4 forearms. |
|
- Goal: The agents must move its body toward the goal direction without falling. |
|
- Agents: The environment contains 10 agents with same Behavior Parameters. |
|
- Agent Reward Function (independent): |
|
The reward function is now geometric meaning the reward each step is a product |
|
of all the rewards instead of a sum, this helps the agent try to maximize all |
|
rewards instead of the easiest rewards. |
|
- Body velocity matches goal velocity. (normalized between (0,1)) |
|
- Head direction alignment with goal direction. (normalized between (0,1)) |
|
- Behavior Parameters: |
|
- Vector Observation space: 172 variables corresponding to position, rotation, |
|
velocity, and angular velocities of each limb plus the acceleration and |
|
angular acceleration of the body. |
|
- Actions: 20 continuous actions, corresponding to target |
|
rotations for joints. |
|
- Visual Observations: None |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 3000 |
|
|
|
## Worm |
|
|
|
 |
|
|
|
- Set-up: A worm with a head and 3 body segments. |
|
- Goal: The agents must move its body toward the goal direction. |
|
- Agents: The environment contains 10 agents with same Behavior Parameters. |
|
- Agent Reward Function (independent): |
|
The reward function is now geometric meaning the reward each step is a product |
|
of all the rewards instead of a sum, this helps the agent try to maximize all |
|
rewards instead of the easiest rewards. |
|
- Body velocity matches goal velocity. (normalized between (0,1)) |
|
- Body direction alignment with goal direction. (normalized between (0,1)) |
|
- Behavior Parameters: |
|
- Vector Observation space: 64 variables corresponding to position, rotation, |
|
velocity, and angular velocities of each limb plus the acceleration and |
|
angular acceleration of the body. |
|
- Actions: 9 continuous actions, corresponding to target |
|
rotations for joints. |
|
- Visual Observations: None |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 800 |
|
|
|
## Food Collector |
|
|
|
 |
|
|
|
- Set-up: A multi-agent environment where agents compete to collect food. |
|
- Goal: The agents must learn to collect as many green food spheres as possible |
|
while avoiding red spheres. |
|
- Agents: The environment contains 5 agents with same Behavior Parameters. |
|
- Agent Reward Function (independent): |
|
- +1 for interaction with green spheres |
|
- -1 for interaction with red spheres |
|
- Behavior Parameters: |
|
- Vector Observation space: 53 corresponding to velocity of agent (2), whether |
|
agent is frozen and/or shot its laser (2), plus grid based perception of |
|
objects around agent's forward direction (40 by 40 with 6 different categories). |
|
- Actions: |
|
- 3 continuous actions correspond to Forward Motion, Side Motion and Rotation |
|
- 1 discrete acion branch for Laser with 2 possible actions corresponding to |
|
Shoot Laser or No Action |
|
- Visual Observations (Optional): First-person camera per-agent, plus one vector |
|
flag representing the frozen state of the agent. This scene uses a combination |
|
of vector and visual observations and the training will not succeed without |
|
the frozen vector flag. Use `VisualFoodCollector` scene. |
|
- Float Properties: Two |
|
- laser_length: Length of the laser used by the agent |
|
- Default: 1 |
|
- Recommended Minimum: 0.2 |
|
- Recommended Maximum: 7 |
|
- agent_scale: Specifies the scale of the agent in the 3 dimensions (equal |
|
across the three dimensions) |
|
- Default: 1 |
|
- Recommended Minimum: 0.5 |
|
- Recommended Maximum: 5 |
|
- Benchmark Mean Reward: 10 |
|
|
|
## Hallway |
|
|
|
 |
|
|
|
- Set-up: Environment where the agent needs to find information in a room, |
|
remember it, and use it to move to the correct goal. |
|
- Goal: Move to the goal which corresponds to the color of the block in the |
|
room. |
|
- Agents: The environment contains one agent. |
|
- Agent Reward Function (independent): |
|
- +1 For moving to correct goal. |
|
- -0.1 For moving to incorrect goal. |
|
- -0.0003 Existential penalty. |
|
- Behavior Parameters: |
|
- Vector Observation space: 30 corresponding to local ray-casts detecting |
|
objects, goals, and walls. |
|
- Actions: 1 discrete action Branch, with 4 actions corresponding to agent |
|
rotation and forward/backward movement. |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 0.7 |
|
- To train this environment, you can enable curiosity by adding the `curiosity` reward signal |
|
in `config/ppo/Hallway.yaml` |
|
|
|
## Soccer Twos |
|
|
|
 |
|
|
|
- Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game. |
|
- Goal: |
|
- Get the ball into the opponent's goal while preventing the ball from |
|
entering own goal. |
|
- Agents: The environment contains two different Multi Agent Groups with two agents in each. |
|
Parameters : SoccerTwos. |
|
- Agent Reward Function (dependent): |
|
- (1 - `accumulated time penalty`) When ball enters opponent's goal |
|
`accumulated time penalty` is incremented by (1 / `MaxStep`) every fixed |
|
update and is reset to 0 at the beginning of an episode. |
|
- -1 When ball enters team's goal. |
|
- Behavior Parameters: |
|
- Vector Observation space: 336 corresponding to 11 ray-casts forward |
|
distributed over 120 degrees and 3 ray-casts backward distributed over 90 |
|
degrees each detecting 6 possible object types, along with the object's |
|
distance. The forward ray-casts contribute 264 state dimensions and backward |
|
72 state dimensions over three observation stacks. |
|
- Actions: 3 discrete branched actions corresponding to |
|
forward, backward, sideways movement, as well as rotation. |
|
- Visual Observations: None |
|
- Float Properties: Two |
|
- ball_scale: Specifies the scale of the ball in the 3 dimensions (equal |
|
across the three dimensions) |
|
- Default: 7.5 |
|
- Recommended minimum: 4 |
|
- Recommended maximum: 10 |
|
- gravity: Magnitude of the gravity |
|
- Default: 9.81 |
|
- Recommended minimum: 6 |
|
- Recommended maximum: 20 |
|
|
|
## Strikers Vs. Goalie |
|
|
|
 |
|
|
|
- Set-up: Environment where two agents compete in a 2 vs 1 soccer variant. |
|
- Goal: |
|
- Striker: Get the ball into the opponent's goal. |
|
- Goalie: Keep the ball out of the goal. |
|
- Agents: The environment contains two different Multi Agent Groups. One with two Strikers and the other one Goalie. |
|
Behavior Parameters : Striker, Goalie. |
|
- Striker Agent Reward Function (dependent): |
|
- +1 When ball enters opponent's goal. |
|
- -0.001 Existential penalty. |
|
- Goalie Agent Reward Function (dependent): |
|
- -1 When ball enters goal. |
|
- 0.001 Existential bonus. |
|
- Behavior Parameters: |
|
- Striker Vector Observation space: 294 corresponding to 11 ray-casts forward |
|
distributed over 120 degrees and 3 ray-casts backward distributed over 90 |
|
degrees each detecting 5 possible object types, along with the object's |
|
distance. The forward ray-casts contribute 231 state dimensions and backward |
|
63 state dimensions over three observation stacks. |
|
- Striker Actions: 3 discrete branched actions corresponding |
|
to forward, backward, sideways movement, as well as rotation. |
|
- Goalie Vector Observation space: 738 corresponding to 41 ray-casts |
|
distributed over 360 degrees each detecting 4 possible object types, along |
|
with the object's distance and 3 observation stacks. |
|
- Goalie Actions: 3 discrete branched actions corresponding |
|
to forward, backward, sideways movement, as well as rotation. |
|
- Visual Observations: None |
|
- Float Properties: Two |
|
- ball_scale: Specifies the scale of the ball in the 3 dimensions (equal |
|
across the three dimensions) |
|
- Default: 7.5 |
|
- Recommended minimum: 4 |
|
- Recommended maximum: 10 |
|
- gravity: Magnitude of the gravity |
|
- Default: 9.81 |
|
- Recommended minimum: 6 |
|
- Recommended maximum: 20 |
|
|
|
## Walker |
|
|
|
 |
|
|
|
- Set-up: Physics-based Humanoid agents with 26 degrees of freedom. These DOFs |
|
correspond to articulation of the following body-parts: hips, chest, spine, |
|
head, thighs, shins, feet, arms, forearms and hands. |
|
- Goal: The agents must move its body toward the goal direction without falling. |
|
- Agents: The environment contains 10 independent agents with same Behavior |
|
Parameters. |
|
- Agent Reward Function (independent): |
|
The reward function is now geometric meaning the reward each step is a product |
|
of all the rewards instead of a sum, this helps the agent try to maximize all |
|
rewards instead of the easiest rewards. |
|
- Body velocity matches goal velocity. (normalized between (0,1)) |
|
- Head direction alignment with goal direction. (normalized between (0,1)) |
|
- Behavior Parameters: |
|
- Vector Observation space: 243 variables corresponding to position, rotation, |
|
velocity, and angular velocities of each limb, along with goal direction. |
|
- Actions: 39 continuous actions, corresponding to target |
|
rotations and strength applicable to the joints. |
|
- Visual Observations: None |
|
- Float Properties: Four |
|
- gravity: Magnitude of gravity |
|
- Default: 9.81 |
|
- Recommended Minimum: |
|
- Recommended Maximum: |
|
- hip_mass: Mass of the hip component of the walker |
|
- Default: 8 |
|
- Recommended Minimum: 7 |
|
- Recommended Maximum: 28 |
|
- chest_mass: Mass of the chest component of the walker |
|
- Default: 8 |
|
- Recommended Minimum: 3 |
|
- Recommended Maximum: 20 |
|
- spine_mass: Mass of the spine component of the walker |
|
- Default: 8 |
|
- Recommended Minimum: 3 |
|
- Recommended Maximum: 20 |
|
- Benchmark Mean Reward : 2500 |
|
|
|
|
|
## Pyramids |
|
|
|
 |
|
|
|
- Set-up: Environment where the agent needs to press a button to spawn a |
|
pyramid, then navigate to the pyramid, knock it over, and move to the gold |
|
brick at the top. |
|
- Goal: Move to the golden brick on top of the spawned pyramid. |
|
- Agents: The environment contains one agent. |
|
- Agent Reward Function (independent): |
|
- +2 For moving to golden brick (minus 0.001 per step). |
|
- Behavior Parameters: |
|
- Vector Observation space: 148 corresponding to local ray-casts detecting |
|
switch, bricks, golden brick, and walls, plus variable indicating switch |
|
state. |
|
- Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and |
|
forward/backward movement. |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 1.75 |
|
|
|
## Match 3 |
|
 |
|
|
|
- Set-up: Simple match-3 game. Matched pieces are removed, and remaining pieces |
|
drop down. New pieces are spawned randomly at the top, with a chance of being |
|
"special". |
|
- Goal: Maximize score from matching pieces. |
|
- Agents: The environment contains several independent Agents. |
|
- Agent Reward Function (independent): |
|
- .01 for each normal piece cleared. Special pieces are worth 2x or 3x. |
|
- Behavior Parameters: |
|
- None |
|
- Observations and actions are defined with a sensor and actuator respectively. |
|
- Float Properties: None |
|
- Benchmark Mean Reward: |
|
- 39.5 for visual observations |
|
- 38.5 for vector observations |
|
- 34.2 for simple heuristic (pick a random valid move) |
|
- 37.0 for greedy heuristic (pick the highest-scoring valid move) |
|
|
|
## Sorter |
|
 |
|
|
|
- Set-up: The Agent is in a circular room with numbered tiles. The values of the |
|
tiles are random between 1 and 20. The tiles present in the room are randomized |
|
at each episode. When the Agent visits a tile, it turns green. |
|
- Goal: Visit all the tiles in ascending order. |
|
- Agents: The environment contains a single Agent |
|
- Agent Reward Function: |
|
- -.0002 Existential penalty. |
|
- +1 For visiting the right tile |
|
- -1 For visiting the wrong tile |
|
- BehaviorParameters: |
|
- Vector Observations : 4 : 2 floats for Position and 2 floats for orientation |
|
- Variable Length Observations : Between 1 and 20 entities (one for each tile) |
|
each with 22 observations, the first 20 are one hot encoding of the value of the tile, |
|
the 21st and 22nd represent the position of the tile relative to the Agent and the 23rd |
|
is `1` if the tile was visited and `0` otherwise. |
|
- Actions: 3 discrete branched actions corresponding to forward, backward, |
|
sideways movement, as well as rotation. |
|
- Float Properties: One |
|
- num_tiles: The maximum number of tiles to sample. |
|
- Default: 2 |
|
- Recommended Minimum: 1 |
|
- Recommended Maximum: 20 |
|
- Benchmark Mean Reward: Depends on the number of tiles. |
|
|
|
## Cooperative Push Block |
|
 |
|
|
|
- Set-up: Similar to Push Block, the agents are in an area with blocks that need |
|
to be pushed into a goal. Small blocks can be pushed by one agents and are worth |
|
+1 value, medium blocks require two agents to push in and are worth +2, and large |
|
blocks require all 3 agents to push and are worth +3. |
|
- Goal: Push all blocks into the goal. |
|
- Agents: The environment contains three Agents in a Multi Agent Group. |
|
- Agent Reward Function: |
|
- -0.0001 Existential penalty, as a group reward. |
|
- +1, +2, or +3 for pushing in a block, added as a group reward. |
|
- Behavior Parameters: |
|
- Observation space: A single Grid Sensor with separate tags for each block size, |
|
the goal, the walls, and other agents. |
|
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
|
and counterclockwise, move along four different face directions, or do nothing. |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 11 (Group Reward) |
|
|
|
## Dungeon Escape |
|
 |
|
|
|
- Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape. |
|
To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself |
|
to do so. The dragon will drop a key for the others to use. The other agents can then pick |
|
up this key and unlock the dungeon door. If the agents take too long, the dragon will escape |
|
through a portal and the environment resets. |
|
- Goal: Unlock the dungeon door and leave. |
|
- Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which |
|
moves in a predetermined pattern. |
|
- Agent Reward Function: |
|
- +1 group reward if any agent successfully unlocks the door and leaves the dungeon. |
|
- Behavior Parameters: |
|
- Observation space: A Ray Perception Sensor with separate tags for the walls, other agents, |
|
the door, key, the dragon, and the dragon's portal. A single Vector Observation which indicates |
|
whether the agent is holding a key. |
|
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
|
and counterclockwise, move along four different face directions, or do nothing. |
|
- Float Properties: None |
|
- Benchmark Mean Reward: 1.0 (Group Reward) |
|
|