File size: 21,646 Bytes
05c9ac2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 |
# Example Learning Environments
<img src="../images/example-envs.png" align="middle" width="3000"/>
The Unity ML-Agents Toolkit includes an expanding set of example environments
that highlight the various features of the toolkit. These environments can also
serve as templates for new environments or as ways to test new ML algorithms.
Environments are located in `Project/Assets/ML-Agents/Examples` and summarized
below.
For the environments that highlight specific features of the toolkit, we provide
the pre-trained model files and the training config file that enables you to
train the scene yourself. The environments that are designed to serve as
challenges for researchers do not have accompanying pre-trained model files or
training configs and are marked as _Optional_ below.
This page only overviews the example environments we provide. To learn more on
how to design and build your own environments see our
[Making a New Learning Environment](Learning-Environment-Create-New.md) page. If
you would like to contribute environments, please see our
[contribution guidelines](CONTRIBUTING.md) page.
## Basic

- Set-up: A linear movement task where the agent must move left or right to
rewarding states.
- Goal: Move to the most reward state.
- Agents: The environment contains one agent.
- Agent Reward Function:
- -0.01 at each step
- +0.1 for arriving at suboptimal state.
- +1.0 for arriving at optimal state.
- Behavior Parameters:
- Vector Observation space: One variable corresponding to current state.
- Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move
right).
- Visual Observations: None
- Float Properties: None
- Benchmark Mean Reward: 0.93
## 3DBall: 3D Balance Ball

- Set-up: A balance-ball task, where the agent balances the ball on it's head.
- Goal: The agent must balance the ball on it's head for as long as possible.
- Agents: The environment contains 12 agents of the same kind, all using the
same Behavior Parameters.
- Agent Reward Function:
- +0.1 for every step the ball remains on it's head.
- -1.0 if the ball falls off.
- Behavior Parameters:
- Vector Observation space: 8 variables corresponding to rotation of the agent
cube, and position and velocity of ball.
- Vector Observation space (Hard Version): 5 variables corresponding to
rotation of the agent cube and position of ball.
- Actions: 2 continuous actions, with one value corresponding to
X-rotation, and the other to Z-rotation.
- Visual Observations: Third-person view from the upper-front of the agent. Use
`Visual3DBall` scene.
- Float Properties: Three
- scale: Specifies the scale of the ball in the 3 dimensions (equal across the
three dimensions)
- Default: 1
- Recommended Minimum: 0.2
- Recommended Maximum: 5
- gravity: Magnitude of gravity
- Default: 9.81
- Recommended Minimum: 4
- Recommended Maximum: 105
- mass: Specifies mass of the ball
- Default: 1
- Recommended Minimum: 0.1
- Recommended Maximum: 20
- Benchmark Mean Reward: 100
## GridWorld

- Set-up: A multi-goal version of the grid-world task. Scene contains agent, goal,
and obstacles.
- Goal: The agent must navigate the grid to the appropriate goal while
avoiding the obstacles.
- Agents: The environment contains nine agents with the same Behavior
Parameters.
- Agent Reward Function:
- -0.01 for every step.
- +1.0 if the agent navigates to the correct goal (episode ends).
- -1.0 if the agent navigates to an incorrect goal (episode ends).
- Behavior Parameters:
- Vector Observation space: None
- Actions: 1 discrete action branch with 5 actions, corresponding to movement in
cardinal directions or not moving. Note that for this environment,
[action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions)
is turned on by default (this option can be toggled using the `Mask Actions`
checkbox within the `trueAgent` GameObject). The trained model file provided
was generated with action masking turned on.
- Visual Observations: One corresponding to top-down view of GridWorld.
- Goal Signal : A one hot vector corresponding to which color is the correct goal
for the Agent
- Float Properties: Three, corresponding to grid size, number of green goals, and
number of red goals.
- Benchmark Mean Reward: 0.8
## Push Block

- Set-up: A platforming environment where the agent can push a block around.
- Goal: The agent must push the block to the goal.
- Agents: The environment contains one agent.
- Agent Reward Function:
- -0.0025 for every step.
- +1.0 if the block touches the goal.
- Behavior Parameters:
- Vector Observation space: (Continuous) 70 variables corresponding to 14
ray-casts each detecting one of three possible objects (wall, goal, or
block).
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
and counterclockwise, move along four different face directions, or do nothing.
- Float Properties: Four
- block_scale: Scale of the block along the x and z dimensions
- Default: 2
- Recommended Minimum: 0.5
- Recommended Maximum: 4
- dynamic_friction: Coefficient of friction for the ground material acting on
moving objects
- Default: 0
- Recommended Minimum: 0
- Recommended Maximum: 1
- static_friction: Coefficient of friction for the ground material acting on
stationary objects
- Default: 0
- Recommended Minimum: 0
- Recommended Maximum: 1
- block_drag: Effect of air resistance on block
- Default: 0.5
- Recommended Minimum: 0
- Recommended Maximum: 2000
- Benchmark Mean Reward: 4.5
## Wall Jump

- Set-up: A platforming environment where the agent can jump over a wall.
- Goal: The agent must use the block to scale the wall and reach the goal.
- Agents: The environment contains one agent linked to two different Models. The
Policy the agent is linked to changes depending on the height of the wall. The
change of Policy is done in the WallJumpAgent class.
- Agent Reward Function:
- -0.0005 for every step.
- +1.0 if the agent touches the goal.
- -1.0 if the agent falls off the platform.
- Behavior Parameters:
- Vector Observation space: Size of 74, corresponding to 14 ray casts each
detecting 4 possible objects. plus the global position of the agent and
whether or not the agent is grounded.
- Actions: 4 discrete action branches:
- Forward Motion (3 possible actions: Forward, Backwards, No Action)
- Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
- Side Motion (3 possible actions: Left, Right, No Action)
- Jump (2 possible actions: Jump, No Action)
- Visual Observations: None
- Float Properties: Four
- Benchmark Mean Reward (Big & Small Wall): 0.8
## Crawler

- Set-up: A creature with 4 arms and 4 forearms.
- Goal: The agents must move its body toward the goal direction without falling.
- Agents: The environment contains 10 agents with same Behavior Parameters.
- Agent Reward Function (independent):
The reward function is now geometric meaning the reward each step is a product
of all the rewards instead of a sum, this helps the agent try to maximize all
rewards instead of the easiest rewards.
- Body velocity matches goal velocity. (normalized between (0,1))
- Head direction alignment with goal direction. (normalized between (0,1))
- Behavior Parameters:
- Vector Observation space: 172 variables corresponding to position, rotation,
velocity, and angular velocities of each limb plus the acceleration and
angular acceleration of the body.
- Actions: 20 continuous actions, corresponding to target
rotations for joints.
- Visual Observations: None
- Float Properties: None
- Benchmark Mean Reward: 3000
## Worm

- Set-up: A worm with a head and 3 body segments.
- Goal: The agents must move its body toward the goal direction.
- Agents: The environment contains 10 agents with same Behavior Parameters.
- Agent Reward Function (independent):
The reward function is now geometric meaning the reward each step is a product
of all the rewards instead of a sum, this helps the agent try to maximize all
rewards instead of the easiest rewards.
- Body velocity matches goal velocity. (normalized between (0,1))
- Body direction alignment with goal direction. (normalized between (0,1))
- Behavior Parameters:
- Vector Observation space: 64 variables corresponding to position, rotation,
velocity, and angular velocities of each limb plus the acceleration and
angular acceleration of the body.
- Actions: 9 continuous actions, corresponding to target
rotations for joints.
- Visual Observations: None
- Float Properties: None
- Benchmark Mean Reward: 800
## Food Collector

- Set-up: A multi-agent environment where agents compete to collect food.
- Goal: The agents must learn to collect as many green food spheres as possible
while avoiding red spheres.
- Agents: The environment contains 5 agents with same Behavior Parameters.
- Agent Reward Function (independent):
- +1 for interaction with green spheres
- -1 for interaction with red spheres
- Behavior Parameters:
- Vector Observation space: 53 corresponding to velocity of agent (2), whether
agent is frozen and/or shot its laser (2), plus grid based perception of
objects around agent's forward direction (40 by 40 with 6 different categories).
- Actions:
- 3 continuous actions correspond to Forward Motion, Side Motion and Rotation
- 1 discrete acion branch for Laser with 2 possible actions corresponding to
Shoot Laser or No Action
- Visual Observations (Optional): First-person camera per-agent, plus one vector
flag representing the frozen state of the agent. This scene uses a combination
of vector and visual observations and the training will not succeed without
the frozen vector flag. Use `VisualFoodCollector` scene.
- Float Properties: Two
- laser_length: Length of the laser used by the agent
- Default: 1
- Recommended Minimum: 0.2
- Recommended Maximum: 7
- agent_scale: Specifies the scale of the agent in the 3 dimensions (equal
across the three dimensions)
- Default: 1
- Recommended Minimum: 0.5
- Recommended Maximum: 5
- Benchmark Mean Reward: 10
## Hallway

- Set-up: Environment where the agent needs to find information in a room,
remember it, and use it to move to the correct goal.
- Goal: Move to the goal which corresponds to the color of the block in the
room.
- Agents: The environment contains one agent.
- Agent Reward Function (independent):
- +1 For moving to correct goal.
- -0.1 For moving to incorrect goal.
- -0.0003 Existential penalty.
- Behavior Parameters:
- Vector Observation space: 30 corresponding to local ray-casts detecting
objects, goals, and walls.
- Actions: 1 discrete action Branch, with 4 actions corresponding to agent
rotation and forward/backward movement.
- Float Properties: None
- Benchmark Mean Reward: 0.7
- To train this environment, you can enable curiosity by adding the `curiosity` reward signal
in `config/ppo/Hallway.yaml`
## Soccer Twos

- Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game.
- Goal:
- Get the ball into the opponent's goal while preventing the ball from
entering own goal.
- Agents: The environment contains two different Multi Agent Groups with two agents in each.
Parameters : SoccerTwos.
- Agent Reward Function (dependent):
- (1 - `accumulated time penalty`) When ball enters opponent's goal
`accumulated time penalty` is incremented by (1 / `MaxStep`) every fixed
update and is reset to 0 at the beginning of an episode.
- -1 When ball enters team's goal.
- Behavior Parameters:
- Vector Observation space: 336 corresponding to 11 ray-casts forward
distributed over 120 degrees and 3 ray-casts backward distributed over 90
degrees each detecting 6 possible object types, along with the object's
distance. The forward ray-casts contribute 264 state dimensions and backward
72 state dimensions over three observation stacks.
- Actions: 3 discrete branched actions corresponding to
forward, backward, sideways movement, as well as rotation.
- Visual Observations: None
- Float Properties: Two
- ball_scale: Specifies the scale of the ball in the 3 dimensions (equal
across the three dimensions)
- Default: 7.5
- Recommended minimum: 4
- Recommended maximum: 10
- gravity: Magnitude of the gravity
- Default: 9.81
- Recommended minimum: 6
- Recommended maximum: 20
## Strikers Vs. Goalie

- Set-up: Environment where two agents compete in a 2 vs 1 soccer variant.
- Goal:
- Striker: Get the ball into the opponent's goal.
- Goalie: Keep the ball out of the goal.
- Agents: The environment contains two different Multi Agent Groups. One with two Strikers and the other one Goalie.
Behavior Parameters : Striker, Goalie.
- Striker Agent Reward Function (dependent):
- +1 When ball enters opponent's goal.
- -0.001 Existential penalty.
- Goalie Agent Reward Function (dependent):
- -1 When ball enters goal.
- 0.001 Existential bonus.
- Behavior Parameters:
- Striker Vector Observation space: 294 corresponding to 11 ray-casts forward
distributed over 120 degrees and 3 ray-casts backward distributed over 90
degrees each detecting 5 possible object types, along with the object's
distance. The forward ray-casts contribute 231 state dimensions and backward
63 state dimensions over three observation stacks.
- Striker Actions: 3 discrete branched actions corresponding
to forward, backward, sideways movement, as well as rotation.
- Goalie Vector Observation space: 738 corresponding to 41 ray-casts
distributed over 360 degrees each detecting 4 possible object types, along
with the object's distance and 3 observation stacks.
- Goalie Actions: 3 discrete branched actions corresponding
to forward, backward, sideways movement, as well as rotation.
- Visual Observations: None
- Float Properties: Two
- ball_scale: Specifies the scale of the ball in the 3 dimensions (equal
across the three dimensions)
- Default: 7.5
- Recommended minimum: 4
- Recommended maximum: 10
- gravity: Magnitude of the gravity
- Default: 9.81
- Recommended minimum: 6
- Recommended maximum: 20
## Walker

- Set-up: Physics-based Humanoid agents with 26 degrees of freedom. These DOFs
correspond to articulation of the following body-parts: hips, chest, spine,
head, thighs, shins, feet, arms, forearms and hands.
- Goal: The agents must move its body toward the goal direction without falling.
- Agents: The environment contains 10 independent agents with same Behavior
Parameters.
- Agent Reward Function (independent):
The reward function is now geometric meaning the reward each step is a product
of all the rewards instead of a sum, this helps the agent try to maximize all
rewards instead of the easiest rewards.
- Body velocity matches goal velocity. (normalized between (0,1))
- Head direction alignment with goal direction. (normalized between (0,1))
- Behavior Parameters:
- Vector Observation space: 243 variables corresponding to position, rotation,
velocity, and angular velocities of each limb, along with goal direction.
- Actions: 39 continuous actions, corresponding to target
rotations and strength applicable to the joints.
- Visual Observations: None
- Float Properties: Four
- gravity: Magnitude of gravity
- Default: 9.81
- Recommended Minimum:
- Recommended Maximum:
- hip_mass: Mass of the hip component of the walker
- Default: 8
- Recommended Minimum: 7
- Recommended Maximum: 28
- chest_mass: Mass of the chest component of the walker
- Default: 8
- Recommended Minimum: 3
- Recommended Maximum: 20
- spine_mass: Mass of the spine component of the walker
- Default: 8
- Recommended Minimum: 3
- Recommended Maximum: 20
- Benchmark Mean Reward : 2500
## Pyramids

- Set-up: Environment where the agent needs to press a button to spawn a
pyramid, then navigate to the pyramid, knock it over, and move to the gold
brick at the top.
- Goal: Move to the golden brick on top of the spawned pyramid.
- Agents: The environment contains one agent.
- Agent Reward Function (independent):
- +2 For moving to golden brick (minus 0.001 per step).
- Behavior Parameters:
- Vector Observation space: 148 corresponding to local ray-casts detecting
switch, bricks, golden brick, and walls, plus variable indicating switch
state.
- Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and
forward/backward movement.
- Float Properties: None
- Benchmark Mean Reward: 1.75
## Match 3

- Set-up: Simple match-3 game. Matched pieces are removed, and remaining pieces
drop down. New pieces are spawned randomly at the top, with a chance of being
"special".
- Goal: Maximize score from matching pieces.
- Agents: The environment contains several independent Agents.
- Agent Reward Function (independent):
- .01 for each normal piece cleared. Special pieces are worth 2x or 3x.
- Behavior Parameters:
- None
- Observations and actions are defined with a sensor and actuator respectively.
- Float Properties: None
- Benchmark Mean Reward:
- 39.5 for visual observations
- 38.5 for vector observations
- 34.2 for simple heuristic (pick a random valid move)
- 37.0 for greedy heuristic (pick the highest-scoring valid move)
## Sorter

- Set-up: The Agent is in a circular room with numbered tiles. The values of the
tiles are random between 1 and 20. The tiles present in the room are randomized
at each episode. When the Agent visits a tile, it turns green.
- Goal: Visit all the tiles in ascending order.
- Agents: The environment contains a single Agent
- Agent Reward Function:
- -.0002 Existential penalty.
- +1 For visiting the right tile
- -1 For visiting the wrong tile
- BehaviorParameters:
- Vector Observations : 4 : 2 floats for Position and 2 floats for orientation
- Variable Length Observations : Between 1 and 20 entities (one for each tile)
each with 22 observations, the first 20 are one hot encoding of the value of the tile,
the 21st and 22nd represent the position of the tile relative to the Agent and the 23rd
is `1` if the tile was visited and `0` otherwise.
- Actions: 3 discrete branched actions corresponding to forward, backward,
sideways movement, as well as rotation.
- Float Properties: One
- num_tiles: The maximum number of tiles to sample.
- Default: 2
- Recommended Minimum: 1
- Recommended Maximum: 20
- Benchmark Mean Reward: Depends on the number of tiles.
## Cooperative Push Block

- Set-up: Similar to Push Block, the agents are in an area with blocks that need
to be pushed into a goal. Small blocks can be pushed by one agents and are worth
+1 value, medium blocks require two agents to push in and are worth +2, and large
blocks require all 3 agents to push and are worth +3.
- Goal: Push all blocks into the goal.
- Agents: The environment contains three Agents in a Multi Agent Group.
- Agent Reward Function:
- -0.0001 Existential penalty, as a group reward.
- +1, +2, or +3 for pushing in a block, added as a group reward.
- Behavior Parameters:
- Observation space: A single Grid Sensor with separate tags for each block size,
the goal, the walls, and other agents.
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
and counterclockwise, move along four different face directions, or do nothing.
- Float Properties: None
- Benchmark Mean Reward: 11 (Group Reward)
## Dungeon Escape

- Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape.
To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself
to do so. The dragon will drop a key for the others to use. The other agents can then pick
up this key and unlock the dungeon door. If the agents take too long, the dragon will escape
through a portal and the environment resets.
- Goal: Unlock the dungeon door and leave.
- Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which
moves in a predetermined pattern.
- Agent Reward Function:
- +1 group reward if any agent successfully unlocks the door and leaves the dungeon.
- Behavior Parameters:
- Observation space: A Ray Perception Sensor with separate tags for the walls, other agents,
the door, key, the dragon, and the dragon's portal. A single Vector Observation which indicates
whether the agent is holding a key.
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
and counterclockwise, move along four different face directions, or do nothing.
- Float Properties: None
- Benchmark Mean Reward: 1.0 (Group Reward)
|