# Example Learning Environments |
<img src="../images/example-envs.png" align="middle" width="3000"/> |
The Unity ML-Agents Toolkit includes an expanding set of example environments |
that highlight the various features of the toolkit. These environments can also |
serve as templates for new environments or as ways to test new ML algorithms. |
Environments are located in `Project/Assets/ML-Agents/Examples` and summarized |
below. |
For the environments that highlight specific features of the toolkit, we provide |
the pre-trained model files and the training config file that enables you to |
train the scene yourself. The environments that are designed to serve as |
challenges for researchers do not have accompanying pre-trained model files or |
training configs and are marked as _Optional_ below. |
This page only overviews the example environments we provide. To learn more on |
how to design and build your own environments see our |
[Making a New Learning Environment](Learning-Environment-Create-New.md) page. If |
you would like to contribute environments, please see our |
[contribution guidelines](CONTRIBUTING.md) page. |
## Basic |
- Set-up: A linear movement task where the agent must move left or right to |
rewarding states. |
- Goal: Move to the most reward state. |
- Agents: The environment contains one agent. |
- Agent Reward Function: |
- -0.01 at each step |
- +0.1 for arriving at suboptimal state. |
- +1.0 for arriving at optimal state. |
- Behavior Parameters: |
- Vector Observation space: One variable corresponding to current state. |
- Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move |
right). |
- Visual Observations: None |
- Float Properties: None |
- Benchmark Mean Reward: 0.93 |
## 3DBall: 3D Balance Ball |
- Set-up: A balance-ball task, where the agent balances the ball on it's head. |
- Goal: The agent must balance the ball on it's head for as long as possible. |
- Agents: The environment contains 12 agents of the same kind, all using the |
same Behavior Parameters. |
- Agent Reward Function: |
- +0.1 for every step the ball remains on it's head. |
- -1.0 if the ball falls off. |
- Behavior Parameters: |
- Vector Observation space: 8 variables corresponding to rotation of the agent |
cube, and position and velocity of ball. |
- Vector Observation space (Hard Version): 5 variables corresponding to |
rotation of the agent cube and position of ball. |
- Actions: 2 continuous actions, with one value corresponding to |
X-rotation, and the other to Z-rotation. |
- Visual Observations: Third-person view from the upper-front of the agent. Use |
`Visual3DBall` scene. |
- Float Properties: Three |
- scale: Specifies the scale of the ball in the 3 dimensions (equal across the |
three dimensions) |
- Default: 1 |
- Recommended Minimum: 0.2 |
- Recommended Maximum: 5 |
- gravity: Magnitude of gravity |
- Default: 9.81 |
- Recommended Minimum: 4 |
- Recommended Maximum: 105 |
- mass: Specifies mass of the ball |
- Default: 1 |
- Recommended Minimum: 0.1 |
- Recommended Maximum: 20 |
- Benchmark Mean Reward: 100 |
## GridWorld |
- Set-up: A multi-goal version of the grid-world task. Scene contains agent, goal, |
and obstacles. |
- Goal: The agent must navigate the grid to the appropriate goal while |
avoiding the obstacles. |
- Agents: The environment contains nine agents with the same Behavior |
Parameters. |
- Agent Reward Function: |
- -0.01 for every step. |
- +1.0 if the agent navigates to the correct goal (episode ends). |
- -1.0 if the agent navigates to an incorrect goal (episode ends). |
- Behavior Parameters: |
- Vector Observation space: None |
- Actions: 1 discrete action branch with 5 actions, corresponding to movement in |
cardinal directions or not moving. Note that for this environment, |
[action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions) |
is turned on by default (this option can be toggled using the `Mask Actions` |
checkbox within the `trueAgent` GameObject). The trained model file provided |
was generated with action masking turned on. |
- Visual Observations: One corresponding to top-down view of GridWorld. |
- Goal Signal : A one hot vector corresponding to which color is the correct goal |
for the Agent |
- Float Properties: Three, corresponding to grid size, number of green goals, and |
number of red goals. |
- Benchmark Mean Reward: 0.8 |
## Push Block |
- Set-up: A platforming environment where the agent can push a block around. |
- Goal: The agent must push the block to the goal. |
- Agents: The environment contains one agent. |
- Agent Reward Function: |
- -0.0025 for every step. |
- +1.0 if the block touches the goal. |
- Behavior Parameters: |
- Vector Observation space: (Continuous) 70 variables corresponding to 14 |
ray-casts each detecting one of three possible objects (wall, goal, or |
block). |
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
and counterclockwise, move along four different face directions, or do nothing. |
- Float Properties: Four |
- block_scale: Scale of the block along the x and z dimensions |
- Default: 2 |
- Recommended Minimum: 0.5 |
- Recommended Maximum: 4 |
- dynamic_friction: Coefficient of friction for the ground material acting on |
moving objects |
- Default: 0 |
- Recommended Minimum: 0 |
- Recommended Maximum: 1 |
- static_friction: Coefficient of friction for the ground material acting on |
stationary objects |
- Default: 0 |
- Recommended Minimum: 0 |
- Recommended Maximum: 1 |
- block_drag: Effect of air resistance on block |
- Default: 0.5 |
- Recommended Minimum: 0 |
- Recommended Maximum: 2000 |
- Benchmark Mean Reward: 4.5 |
## Wall Jump |
- Set-up: A platforming environment where the agent can jump over a wall. |
- Goal: The agent must use the block to scale the wall and reach the goal. |
- Agents: The environment contains one agent linked to two different Models. The |
Policy the agent is linked to changes depending on the height of the wall. The |
change of Policy is done in the WallJumpAgent class. |
- Agent Reward Function: |
- -0.0005 for every step. |
- +1.0 if the agent touches the goal. |
- -1.0 if the agent falls off the platform. |
- Behavior Parameters: |
- Vector Observation space: Size of 74, corresponding to 14 ray casts each |
detecting 4 possible objects. plus the global position of the agent and |
whether or not the agent is grounded. |
- Actions: 4 discrete action branches: |
- Forward Motion (3 possible actions: Forward, Backwards, No Action) |
- Rotation (3 possible actions: Rotate Left, Rotate Right, No Action) |
- Side Motion (3 possible actions: Left, Right, No Action) |
- Jump (2 possible actions: Jump, No Action) |
- Visual Observations: None |
- Float Properties: Four |
- Benchmark Mean Reward (Big & Small Wall): 0.8 |
## Crawler |
- Set-up: A creature with 4 arms and 4 forearms. |
- Goal: The agents must move its body toward the goal direction without falling. |
- Agents: The environment contains 10 agents with same Behavior Parameters. |
- Agent Reward Function (independent): |
The reward function is now geometric meaning the reward each step is a product |
of all the rewards instead of a sum, this helps the agent try to maximize all |
rewards instead of the easiest rewards. |
- Body velocity matches goal velocity. (normalized between (0,1)) |
- Head direction alignment with goal direction. (normalized between (0,1)) |
- Behavior Parameters: |
- Vector Observation space: 172 variables corresponding to position, rotation, |
velocity, and angular velocities of each limb plus the acceleration and |
angular acceleration of the body. |
- Actions: 20 continuous actions, corresponding to target |
rotations for joints. |
- Visual Observations: None |
- Float Properties: None |
- Benchmark Mean Reward: 3000 |
## Worm |
- Set-up: A worm with a head and 3 body segments. |
- Goal: The agents must move its body toward the goal direction. |
- Agents: The environment contains 10 agents with same Behavior Parameters. |
- Agent Reward Function (independent): |
The reward function is now geometric meaning the reward each step is a product |
of all the rewards instead of a sum, this helps the agent try to maximize all |
rewards instead of the easiest rewards. |
- Body velocity matches goal velocity. (normalized between (0,1)) |
- Body direction alignment with goal direction. (normalized between (0,1)) |
- Behavior Parameters: |
- Vector Observation space: 64 variables corresponding to position, rotation, |
velocity, and angular velocities of each limb plus the acceleration and |
angular acceleration of the body. |
- Actions: 9 continuous actions, corresponding to target |
rotations for joints. |
- Visual Observations: None |
- Float Properties: None |
- Benchmark Mean Reward: 800 |
## Food Collector |
- Set-up: A multi-agent environment where agents compete to collect food. |
- Goal: The agents must learn to collect as many green food spheres as possible |
while avoiding red spheres. |
- Agents: The environment contains 5 agents with same Behavior Parameters. |
- Agent Reward Function (independent): |
- +1 for interaction with green spheres |
- -1 for interaction with red spheres |
- Behavior Parameters: |
- Vector Observation space: 53 corresponding to velocity of agent (2), whether |
agent is frozen and/or shot its laser (2), plus grid based perception of |
objects around agent's forward direction (40 by 40 with 6 different categories). |
- Actions: |
- 3 continuous actions correspond to Forward Motion, Side Motion and Rotation |
- 1 discrete acion branch for Laser with 2 possible actions corresponding to |
Shoot Laser or No Action |
- Visual Observations (Optional): First-person camera per-agent, plus one vector |
flag representing the frozen state of the agent. This scene uses a combination |
of vector and visual observations and the training will not succeed without |
the frozen vector flag. Use `VisualFoodCollector` scene. |
- Float Properties: Two |
- laser_length: Length of the laser used by the agent |
- Default: 1 |
- Recommended Minimum: 0.2 |
- Recommended Maximum: 7 |
- agent_scale: Specifies the scale of the agent in the 3 dimensions (equal |
across the three dimensions) |
- Default: 1 |
- Recommended Minimum: 0.5 |
- Recommended Maximum: 5 |
- Benchmark Mean Reward: 10 |
## Hallway |
- Set-up: Environment where the agent needs to find information in a room, |
remember it, and use it to move to the correct goal. |
- Goal: Move to the goal which corresponds to the color of the block in the |
room. |
- Agents: The environment contains one agent. |
- Agent Reward Function (independent): |
- +1 For moving to correct goal. |
- -0.1 For moving to incorrect goal. |
- -0.0003 Existential penalty. |
- Behavior Parameters: |
- Vector Observation space: 30 corresponding to local ray-casts detecting |
objects, goals, and walls. |
- Actions: 1 discrete action Branch, with 4 actions corresponding to agent |
rotation and forward/backward movement. |
- Float Properties: None |
- Benchmark Mean Reward: 0.7 |
- To train this environment, you can enable curiosity by adding the `curiosity` reward signal |
in `config/ppo/Hallway.yaml` |
## Soccer Twos |
- Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game. |
- Goal: |
- Get the ball into the opponent's goal while preventing the ball from |
entering own goal. |
- Agents: The environment contains two different Multi Agent Groups with two agents in each. |
Parameters : SoccerTwos. |
- Agent Reward Function (dependent): |
- (1 - `accumulated time penalty`) When ball enters opponent's goal |
`accumulated time penalty` is incremented by (1 / `MaxStep`) every fixed |
update and is reset to 0 at the beginning of an episode. |
- -1 When ball enters team's goal. |
- Behavior Parameters: |
- Vector Observation space: 336 corresponding to 11 ray-casts forward |
distributed over 120 degrees and 3 ray-casts backward distributed over 90 |
degrees each detecting 6 possible object types, along with the object's |
distance. The forward ray-casts contribute 264 state dimensions and backward |
72 state dimensions over three observation stacks. |
- Actions: 3 discrete branched actions corresponding to |
forward, backward, sideways movement, as well as rotation. |
- Visual Observations: None |
- Float Properties: Two |
- ball_scale: Specifies the scale of the ball in the 3 dimensions (equal |
across the three dimensions) |
- Default: 7.5 |
- Recommended minimum: 4 |
- Recommended maximum: 10 |
- gravity: Magnitude of the gravity |
- Default: 9.81 |
- Recommended minimum: 6 |
- Recommended maximum: 20 |
## Strikers Vs. Goalie |
- Set-up: Environment where two agents compete in a 2 vs 1 soccer variant. |
- Goal: |
- Striker: Get the ball into the opponent's goal. |
- Goalie: Keep the ball out of the goal. |
- Agents: The environment contains two different Multi Agent Groups. One with two Strikers and the other one Goalie. |
Behavior Parameters : Striker, Goalie. |
- Striker Agent Reward Function (dependent): |
- +1 When ball enters opponent's goal. |
- -0.001 Existential penalty. |
- Goalie Agent Reward Function (dependent): |
- -1 When ball enters goal. |
- 0.001 Existential bonus. |
- Behavior Parameters: |
- Striker Vector Observation space: 294 corresponding to 11 ray-casts forward |
distributed over 120 degrees and 3 ray-casts backward distributed over 90 |
degrees each detecting 5 possible object types, along with the object's |
distance. The forward ray-casts contribute 231 state dimensions and backward |
63 state dimensions over three observation stacks. |
- Striker Actions: 3 discrete branched actions corresponding |
to forward, backward, sideways movement, as well as rotation. |
- Goalie Vector Observation space: 738 corresponding to 41 ray-casts |
distributed over 360 degrees each detecting 4 possible object types, along |
with the object's distance and 3 observation stacks. |
- Goalie Actions: 3 discrete branched actions corresponding |
to forward, backward, sideways movement, as well as rotation. |
- Visual Observations: None |
- Float Properties: Two |
- ball_scale: Specifies the scale of the ball in the 3 dimensions (equal |
across the three dimensions) |
- Default: 7.5 |
- Recommended minimum: 4 |
- Recommended maximum: 10 |
- gravity: Magnitude of the gravity |
- Default: 9.81 |
- Recommended minimum: 6 |
- Recommended maximum: 20 |
## Walker |
- Set-up: Physics-based Humanoid agents with 26 degrees of freedom. These DOFs |
correspond to articulation of the following body-parts: hips, chest, spine, |
head, thighs, shins, feet, arms, forearms and hands. |
- Goal: The agents must move its body toward the goal direction without falling. |
- Agents: The environment contains 10 independent agents with same Behavior |
Parameters. |
- Agent Reward Function (independent): |
The reward function is now geometric meaning the reward each step is a product |
of all the rewards instead of a sum, this helps the agent try to maximize all |
rewards instead of the easiest rewards. |
- Body velocity matches goal velocity. (normalized between (0,1)) |
- Head direction alignment with goal direction. (normalized between (0,1)) |
- Behavior Parameters: |
- Vector Observation space: 243 variables corresponding to position, rotation, |
velocity, and angular velocities of each limb, along with goal direction. |
- Actions: 39 continuous actions, corresponding to target |
rotations and strength applicable to the joints. |
- Visual Observations: None |
- Float Properties: Four |
- gravity: Magnitude of gravity |
- Default: 9.81 |
- Recommended Minimum: |
- Recommended Maximum: |
- hip_mass: Mass of the hip component of the walker |
- Default: 8 |
- Recommended Minimum: 7 |
- Recommended Maximum: 28 |
- chest_mass: Mass of the chest component of the walker |
- Default: 8 |
- Recommended Minimum: 3 |
- Recommended Maximum: 20 |
- spine_mass: Mass of the spine component of the walker |
- Default: 8 |
- Recommended Minimum: 3 |
- Recommended Maximum: 20 |
- Benchmark Mean Reward : 2500 |
## Pyramids |
- Set-up: Environment where the agent needs to press a button to spawn a |
pyramid, then navigate to the pyramid, knock it over, and move to the gold |
brick at the top. |
- Goal: Move to the golden brick on top of the spawned pyramid. |
- Agents: The environment contains one agent. |
- Agent Reward Function (independent): |
- +2 For moving to golden brick (minus 0.001 per step). |
- Behavior Parameters: |
- Vector Observation space: 148 corresponding to local ray-casts detecting |
switch, bricks, golden brick, and walls, plus variable indicating switch |
state. |
- Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and |
forward/backward movement. |
- Float Properties: None |
- Benchmark Mean Reward: 1.75 |
## Match 3 |
- Set-up: Simple match-3 game. Matched pieces are removed, and remaining pieces |
drop down. New pieces are spawned randomly at the top, with a chance of being |
"special". |
- Goal: Maximize score from matching pieces. |
- Agents: The environment contains several independent Agents. |
- Agent Reward Function (independent): |
- .01 for each normal piece cleared. Special pieces are worth 2x or 3x. |
- Behavior Parameters: |
- None |
- Observations and actions are defined with a sensor and actuator respectively. |
- Float Properties: None |
- Benchmark Mean Reward: |
- 39.5 for visual observations |
- 38.5 for vector observations |
- 34.2 for simple heuristic (pick a random valid move) |
- 37.0 for greedy heuristic (pick the highest-scoring valid move) |
## Sorter |
- Set-up: The Agent is in a circular room with numbered tiles. The values of the |
tiles are random between 1 and 20. The tiles present in the room are randomized |
at each episode. When the Agent visits a tile, it turns green. |
- Goal: Visit all the tiles in ascending order. |
- Agents: The environment contains a single Agent |
- Agent Reward Function: |
- -.0002 Existential penalty. |
- +1 For visiting the right tile |
- -1 For visiting the wrong tile |
- BehaviorParameters: |
- Vector Observations : 4 : 2 floats for Position and 2 floats for orientation |
- Variable Length Observations : Between 1 and 20 entities (one for each tile) |
each with 22 observations, the first 20 are one hot encoding of the value of the tile, |
the 21st and 22nd represent the position of the tile relative to the Agent and the 23rd |
is `1` if the tile was visited and `0` otherwise. |
- Actions: 3 discrete branched actions corresponding to forward, backward, |
sideways movement, as well as rotation. |
- Float Properties: One |
- num_tiles: The maximum number of tiles to sample. |
- Default: 2 |
- Recommended Minimum: 1 |
- Recommended Maximum: 20 |
- Benchmark Mean Reward: Depends on the number of tiles. |
## Cooperative Push Block |
- Set-up: Similar to Push Block, the agents are in an area with blocks that need |
to be pushed into a goal. Small blocks can be pushed by one agents and are worth |
+1 value, medium blocks require two agents to push in and are worth +2, and large |
blocks require all 3 agents to push and are worth +3. |
- Goal: Push all blocks into the goal. |
- Agents: The environment contains three Agents in a Multi Agent Group. |
- Agent Reward Function: |
- -0.0001 Existential penalty, as a group reward. |
- +1, +2, or +3 for pushing in a block, added as a group reward. |
- Behavior Parameters: |
- Observation space: A single Grid Sensor with separate tags for each block size, |
the goal, the walls, and other agents. |
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
and counterclockwise, move along four different face directions, or do nothing. |
- Float Properties: None |
- Benchmark Mean Reward: 11 (Group Reward) |
## Dungeon Escape |
- Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape. |
To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself |
to do so. The dragon will drop a key for the others to use. The other agents can then pick |
up this key and unlock the dungeon door. If the agents take too long, the dragon will escape |
through a portal and the environment resets. |
- Goal: Unlock the dungeon door and leave. |
- Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which |
moves in a predetermined pattern. |
- Agent Reward Function: |
- +1 group reward if any agent successfully unlocks the door and leaves the dungeon. |
- Behavior Parameters: |
- Observation space: A Ray Perception Sensor with separate tags for the walls, other agents, |
the door, key, the dragon, and the dragon's portal. A single Vector Observation which indicates |
whether the agent is holding a key. |
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
and counterclockwise, move along four different face directions, or do nothing. |
- Float Properties: None |
- Benchmark Mean Reward: 1.0 (Group Reward) |