|
# Agents |
|
|
|
**Table of Contents:** |
|
|
|
- [Decisions](#decisions) |
|
- [Observations and Sensors](#observations-and-sensors) |
|
- [Generating Observations](#generating-observations) |
|
- [Agent.CollectObservations()](#agentcollectobservations) |
|
- [Observable Fields and Properties](#observable-fields-and-properties) |
|
- [ISensor interface and SensorComponents](#isensor-interface-and-sensorcomponents) |
|
- [Vector Observations](#vector-observations) |
|
- [One-hot encoding categorical information](#one-hot-encoding-categorical-information) |
|
- [Normalization](#normalization) |
|
- [Stacking](#stacking) |
|
- [Vector Observation Summary & Best Practices](#vector-observation-summary--best-practices) |
|
- [Visual Observations](#visual-observations) |
|
- [Visual Observation Summary & Best Practices](#visual-observation-summary--best-practices) |
|
- [Raycast Observations](#raycast-observations) |
|
- [RayCast Observation Summary & Best Practices](#raycast-observation-summary--best-practices) |
|
- [Variable Length Observations](#variable-length-observations) |
|
- [Variable Length Observation Summary & Best Practices](#variable-length-observation-summary--best-practices) |
|
- [Goal Signal](#goal-signal) |
|
- [Goal Signal Summary & Best Practices](#goal-signal-summary--best-practices) |
|
- [Actions and Actuators](#actions-and-actuators) |
|
- [Continuous Actions](#continuous-actions) |
|
- [Discrete Actions](#discrete-actions) |
|
- [Masking Discrete Actions](#masking-discrete-actions) |
|
- [Actions Summary & Best Practices](#actions-summary--best-practices) |
|
- [Rewards](#rewards) |
|
- [Examples](#examples) |
|
- [Rewards Summary & Best Practices](#rewards-summary--best-practices) |
|
- [Agent Properties](#agent-properties) |
|
- [Destroying an Agent](#destroying-an-agent) |
|
- [Defining Multi-agent Scenarios](#defining-multi-agent-scenarios) |
|
- [Teams for Adversarial Scenarios](#teams-for-adversarial-scenarios) |
|
- [Groups for Cooperative Scenarios](#groups-for-cooperative-scenarios) |
|
- [Recording Demonstrations](#recording-demonstrations) |
|
|
|
An agent is an entity that can observe its environment, decide on the best |
|
course of action using those observations, and execute those actions within its |
|
environment. Agents can be created in Unity by extending the `Agent` class. The |
|
most important aspects of creating agents that can successfully learn are the |
|
observations the agent collects, and the reward you assign to estimate the value |
|
of the agent's current state toward accomplishing its tasks. |
|
|
|
An Agent passes its observations to its Policy. The Policy then makes a decision |
|
and passes the chosen action back to the agent. Your agent code must execute the |
|
action, for example, move the agent in one direction or another. In order to |
|
[train an agent using reinforcement learning](Learning-Environment-Design.md), |
|
your agent must calculate a reward value at each action. The reward is used to |
|
discover the optimal decision-making policy. |
|
|
|
The `Policy` class abstracts out the decision making logic from the Agent itself |
|
so that you can use the same Policy in multiple Agents. How a Policy makes its |
|
decisions depends on the `Behavior Parameters` associated with the agent. If you |
|
set `Behavior Type` to `Heuristic Only`, the Agent will use its `Heuristic()` |
|
method to make decisions which can allow you to control the Agent manually or |
|
write your own Policy. If the Agent has a `Model` file, its Policy will use the |
|
neural network `Model` to take decisions. |
|
|
|
When you create an Agent, you should usually extend the base Agent class. This |
|
includes implementing the following methods: |
|
|
|
- `Agent.OnEpisodeBegin()` β Called at the beginning of an Agent's episode, |
|
including at the beginning of the simulation. |
|
- `Agent.CollectObservations(VectorSensor sensor)` β Called every step that the Agent |
|
requests a decision. This is one possible way for collecting the Agent's |
|
observations of the environment; see [Generating Observations](#generating-observations) |
|
below for more options. |
|
- `Agent.OnActionReceived()` β Called every time the Agent receives an action to |
|
take. Receives the action chosen by the Agent. It is also common to assign a |
|
reward in this method. |
|
- `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in |
|
the Behavior Parameters of the Agent, the Agent will use the `Heuristic()` |
|
method to generate the actions of the Agent. As such, the `Heuristic()` method |
|
writes to the array of floats provided to the Heuristic method as argument. |
|
__Note__: Do not create a new float array of action in the `Heuristic()` method, |
|
as this will prevent writing floats to the original action array. |
|
|
|
As a concrete example, here is how the Ball3DAgent class implements these methods: |
|
|
|
- `Agent.OnEpisodeBegin()` β Resets the agent cube and ball to their starting |
|
positions. The function randomizes the reset values so that the training |
|
generalizes to more than a specific starting position and agent cube |
|
orientation. |
|
- `Agent.CollectObservations(VectorSensor sensor)` β Adds information about the |
|
orientation of the agent cube, the ball velocity, and the relative position |
|
between the ball and the cube. Since the `CollectObservations()` |
|
method calls `VectorSensor.AddObservation()` such that vector size adds up to 8, |
|
the Behavior Parameters of the Agent are set with vector observation space |
|
with a state size of 8. |
|
- `Agent.OnActionReceived()` β The action results |
|
in a small change in the agent cube's rotation at each step. In this example, |
|
an Agent receives a small positive reward for each step it keeps the ball on the |
|
agent cube's head and a larger, negative reward for dropping the ball. An |
|
Agent's episode is also ended when it drops the ball so that it will reset |
|
with a new ball for the next simulation step. |
|
- `Agent.Heuristic()` - Converts the keyboard inputs into actions. |
|
|
|
## Decisions |
|
|
|
The observation-decision-action-reward cycle repeats each time the Agent request |
|
a decision. Agents will request a decision when `Agent.RequestDecision()` is |
|
called. If you need the Agent to request decisions on its own at regular |
|
intervals, add a `Decision Requester` component to the Agent's GameObject. |
|
Making decisions at regular step intervals is generally most appropriate for |
|
physics-based simulations. For example, an agent in a robotic simulator that |
|
must provide fine-control of joint torques should make its decisions every step |
|
of the simulation. In games such as real-time strategy, where many agents make |
|
their decisions at regular intervals, the decision timing for each agent can be |
|
staggered by setting the `DecisionStep` parameter in the `Decision Requester` |
|
component for each agent. On the other hand, an agent that only needs to make |
|
decisions when certain game or simulation events occur, such as in a turn-based |
|
game, should call `Agent.RequestDecision()` manually. |
|
|
|
## Observations and Sensors |
|
In order for an agent to learn, the observations should include all the |
|
information an agent needs to accomplish its task. Without sufficient and |
|
relevant information, an agent may learn poorly or may not learn at all. A |
|
reasonable approach for determining what information should be included is to |
|
consider what you would need to calculate an analytical solution to the problem, |
|
or what you would expect a human to be able to use to solve the problem. |
|
|
|
### Generating Observations |
|
ML-Agents provides multiple ways for an Agent to make observations: |
|
1. Overriding the `Agent.CollectObservations()` method and passing the |
|
observations to the provided `VectorSensor`. |
|
1. Adding the `[Observable]` attribute to fields and properties on the Agent. |
|
1. Implementing the `ISensor` interface, using a `SensorComponent` attached to |
|
the Agent to create the `ISensor`. |
|
|
|
#### Agent.CollectObservations() |
|
Agent.CollectObservations() is best used for aspects of the environment which are |
|
numerical and non-visual. The Policy class calls the |
|
`CollectObservations(VectorSensor sensor)` method of each Agent. Your |
|
implementation of this function must call `VectorSensor.AddObservation` to add |
|
vector observations. |
|
|
|
The `VectorSensor.AddObservation` method provides a number of overloads for |
|
adding common types of data to your observation vector. You can add Integers and |
|
booleans directly to the observation vector, as well as some common Unity data |
|
types such as `Vector2`, `Vector3`, and `Quaternion`. |
|
|
|
For examples of various state observation functions, you can look at the |
|
[example environments](Learning-Environment-Examples.md) included in the |
|
ML-Agents SDK. For instance, the 3DBall example uses the rotation of the |
|
platform, the relative position of the ball, and the velocity of the ball as its |
|
state observation. |
|
|
|
```csharp |
|
public GameObject ball; |
|
|
|
public override void CollectObservations(VectorSensor sensor) |
|
{ |
|
// Orientation of the cube (2 floats) |
|
sensor.AddObservation(gameObject.transform.rotation.z); |
|
sensor.AddObservation(gameObject.transform.rotation.x); |
|
// Relative position of the ball to the cube (3 floats) |
|
sensor.AddObservation(ball.transform.position - gameObject.transform.position); |
|
// Velocity of the ball (3 floats) |
|
sensor.AddObservation(m_BallRb.velocity); |
|
// 8 floats total |
|
} |
|
``` |
|
|
|
As an experiment, you can remove the velocity components from |
|
the observation and retrain the 3DBall agent. While it will learn to balance the |
|
ball reasonably well, the performance of the agent without using velocity is |
|
noticeably worse. |
|
|
|
The observations passed to `VectorSensor.AddObservation()` must always contain |
|
the same number of elements must always be in the same order. If the number |
|
of observed entities in an environment can vary, you can pad the calls |
|
with zeros for any missing entities in a specific observation, or you can limit |
|
an agent's observations to a fixed subset. For example, instead of observing |
|
every enemy in an environment, you could only observe the closest five. |
|
|
|
Additionally, when you set up an Agent's `Behavior Parameters` in the Unity |
|
Editor, you must set the **Vector Observations > Space Size** |
|
to equal the number of floats that are written by `CollectObservations()`. |
|
|
|
#### Observable Fields and Properties |
|
Another approach is to define the relevant observations as fields or properties |
|
on your Agent class, and annotate them with an `ObservableAttribute`. For |
|
example, in the Ball3DHardAgent, the difference between positions could be observed |
|
by adding a property to the Agent: |
|
```csharp |
|
using Unity.MLAgents.Sensors.Reflection; |
|
|
|
public class Ball3DHardAgent : Agent { |
|
|
|
[Observable(numStackedObservations: 9)] |
|
Vector3 PositionDelta |
|
{ |
|
get |
|
{ |
|
return ball.transform.position - gameObject.transform.position; |
|
} |
|
} |
|
} |
|
``` |
|
`ObservableAttribute` currently supports most basic types (e.g. floats, ints, |
|
bools), as well as `Vector2`, `Vector3`, `Vector4`, `Quaternion`, and enums. |
|
|
|
The behavior of `ObservableAttribute`s are controlled by the "Observable Attribute |
|
Handling" in the Agent's `Behavior Parameters`. The possible values for this are: |
|
* **Ignore** (default) - All ObservableAttributes on the Agent will be ignored. |
|
If there are no ObservableAttributes on the Agent, this will result in the |
|
fastest initialization time. |
|
* **Exclude Inherited** - Only members on the declared class will be examined; |
|
members that are inherited are ignored. This is a reasonable tradeoff between |
|
performance and flexibility. |
|
* **Examine All** All members on the class will be examined. This can lead to |
|
slower startup times. |
|
|
|
"Exclude Inherited" is generally sufficient, but if your Agent inherits from |
|
another Agent implementation that has Observable members, you will need to use |
|
"Examine All". |
|
|
|
Internally, ObservableAttribute uses reflection to determine which members of |
|
the Agent have ObservableAttributes, and also uses reflection to access the |
|
fields or invoke the properties at runtime. This may be slower than using |
|
CollectObservations or an ISensor, although this might not be enough to |
|
noticeably affect performance. |
|
|
|
**NOTE**: you do not need to adjust the Space Size in the Agent's |
|
`Behavior Parameters` when you add `[Observable]` fields or properties to an |
|
Agent, since their size can be computed before they are used. |
|
|
|
#### ISensor interface and SensorComponents |
|
The `ISensor` interface is generally intended for advanced users. The `Write()` |
|
method is used to actually generate the observation, but some other methods |
|
such as returning the shape of the observations must also be implemented. |
|
|
|
The `SensorComponent` abstract class is used to create the actual `ISensor` at |
|
runtime. It must be attached to the same `GameObject` as the `Agent`, or to a |
|
child `GameObject`. |
|
|
|
There are several SensorComponents provided in the API, including: |
|
- `CameraSensorComponent` - Uses images from a `Camera` as observations. |
|
- `RenderTextureSensorComponent` - Uses the content of a `RenderTexture` as |
|
observations. |
|
- `RayPerceptionSensorComponent` - Uses the information from set of ray casts |
|
as observations. |
|
- `Match3SensorComponent` - Uses the board of a [Match-3 game](Integrations-Match3.md) |
|
as observations. |
|
- `GridSensorComponent` - Uses a set of box queries in a grid shape as |
|
observations. |
|
|
|
**NOTE**: you do not need to adjust the Space Size in the Agent's |
|
`Behavior Parameters` when using `SensorComponents`s. |
|
|
|
Internally, both `Agent.CollectObservations` and `[Observable]` attribute use an |
|
ISensors to write observations, although this is mostly abstracted from the user. |
|
|
|
### Vector Observations |
|
Both `Agent.CollectObservations()` and `ObservableAttribute`s produce vector |
|
observations, which are represented at lists of `float`s. `ISensor`s can |
|
produce both vector observations and visual observations, which are |
|
multi-dimensional arrays of floats. |
|
|
|
Below are some additional considerations when dealing with vector observations: |
|
|
|
#### One-hot encoding categorical information |
|
|
|
Type enumerations should be encoded in the _one-hot_ style. That is, add an |
|
element to the feature vector for each element of enumeration, setting the |
|
element representing the observed member to one and set the rest to zero. For |
|
example, if your enumeration contains \[Sword, Shield, Bow\] and the agent |
|
observes that the current item is a Bow, you would add the elements: 0, 0, 1 to |
|
the feature vector. The following code example illustrates how to add. |
|
|
|
```csharp |
|
enum ItemType { Sword, Shield, Bow, LastItem } |
|
public override void CollectObservations(VectorSensor sensor) |
|
{ |
|
for (int ci = 0; ci < (int)ItemType.LastItem; ci++) |
|
{ |
|
sensor.AddObservation((int)currentItem == ci ? 1.0f : 0.0f); |
|
} |
|
} |
|
``` |
|
|
|
`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as |
|
a shortcut for _one-hot_ style observations. The following example is identical |
|
to the previous one. |
|
|
|
```csharp |
|
enum ItemType { Sword, Shield, Bow, LastItem } |
|
const int NUM_ITEM_TYPES = (int)ItemType.LastItem + 1; |
|
|
|
public override void CollectObservations(VectorSensor sensor) |
|
{ |
|
// The first argument is the selection index; the second is the |
|
// number of possibilities |
|
sensor.AddOneHotObservation((int)currentItem, NUM_ITEM_TYPES); |
|
} |
|
``` |
|
|
|
`ObservableAttribute` has built-in support for enums. Note that you don't need |
|
the `LastItem` placeholder in this case: |
|
```csharp |
|
enum ItemType { Sword, Shield, Bow } |
|
|
|
public class HeroAgent : Agent |
|
{ |
|
[Observable] |
|
ItemType m_CurrentItem; |
|
} |
|
``` |
|
|
|
#### Normalization |
|
|
|
For the best results when training, you should normalize the components of your |
|
feature vector to the range [-1, +1] or [0, 1]. When you normalize the values, |
|
the PPO neural network can often converge to a solution faster. Note that it |
|
isn't always necessary to normalize to these recommended ranges, but it is |
|
considered a best practice when using neural networks. The greater the variation |
|
in ranges between the components of your observation, the more likely that |
|
training will be affected. |
|
|
|
To normalize a value to [0, 1], you can use the following formula: |
|
|
|
```csharp |
|
normalizedValue = (currentValue - minValue)/(maxValue - minValue) |
|
``` |
|
|
|
:warning: For vectors, you should apply the above formula to each component (x, |
|
y, and z). Note that this is _not_ the same as using the `Vector3.normalized` |
|
property or `Vector3.Normalize()` method in Unity (and similar for `Vector2`). |
|
|
|
Rotations and angles should also be normalized. For angles between 0 and 360 |
|
degrees, you can use the following formulas: |
|
|
|
```csharp |
|
Quaternion rotation = transform.rotation; |
|
Vector3 normalized = rotation.eulerAngles / 180.0f - Vector3.one; // [-1,1] |
|
Vector3 normalized = rotation.eulerAngles / 360.0f; // [0,1] |
|
``` |
|
|
|
For angles that can be outside the range [0,360], you can either reduce the |
|
angle, or, if the number of turns is significant, increase the maximum value |
|
used in your normalization formula. |
|
|
|
#### Stacking |
|
Stacking refers to repeating observations from previous steps as part of a |
|
larger observation. For example, consider an Agent that generates these |
|
observations in four steps |
|
``` |
|
step 1: [0.1] |
|
step 2: [0.2] |
|
step 3: [0.3] |
|
step 4: [0.4] |
|
``` |
|
|
|
If we use a stack size of 3, the observations would instead be: |
|
```csharp |
|
step 1: [0.1, 0.0, 0.0] |
|
step 2: [0.2, 0.1, 0.0] |
|
step 3: [0.3, 0.2, 0.1] |
|
step 4: [0.4, 0.3, 0.2] |
|
``` |
|
(The observations are padded with zeroes for the first `stackSize-1` steps). |
|
This is a simple way to give an Agent limited "memory" without the complexity |
|
of adding a recurrent neural network (RNN). |
|
|
|
The steps for enabling stacking depends on how you generate observations: |
|
* For Agent.CollectObservations(), set "Stacked Vectors" on the Agent's |
|
`Behavior Parameters` to a value greater than 1. |
|
* For ObservableAttribute, set the `numStackedObservations` parameter in the |
|
constructor, e.g. `[Observable(numStackedObservations: 2)]`. |
|
* For `ISensor`s, wrap them in a `StackingSensor` (which is also an `ISensor`). |
|
Generally, this should happen in the `CreateSensor()` method of your |
|
`SensorComponent`. |
|
|
|
#### Vector Observation Summary & Best Practices |
|
|
|
- Vector Observations should include all variables relevant for allowing the |
|
agent to take the optimally informed decision, and ideally no extraneous |
|
information. |
|
- In cases where Vector Observations need to be remembered or compared over |
|
time, either an RNN should be used in the model, or the `Stacked Vectors` |
|
value in the agent GameObject's `Behavior Parameters` should be changed. |
|
- Categorical variables such as type of object (Sword, Shield, Bow) should be |
|
encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done |
|
automatically using the `AddOneHotObservation()` method of the `VectorSensor`, |
|
or using `[Observable]` on an enum field or property of the Agent. |
|
- In general, all inputs should be normalized to be in the range 0 to +1 (or -1 |
|
to 1). For example, the `x` position information of an agent where the maximum |
|
possible value is `maxValue` should be recorded as |
|
`VectorSensor.AddObservation(transform.position.x / maxValue);` rather than |
|
`VectorSensor.AddObservation(transform.position.x);`. |
|
- Positional information of relevant GameObjects should be encoded in relative |
|
coordinates wherever possible. This is often relative to the agent position. |
|
|
|
### Visual Observations |
|
|
|
Visual observations are generally provided to agent via either a `CameraSensor` |
|
or `RenderTextureSensor`. These collect image information and transforms it into |
|
a 3D Tensor which can be fed into the convolutional neural network (CNN) of the |
|
agent policy. For more information on CNNs, see |
|
[this guide](http://cs231n.github.io/convolutional-networks/). This allows |
|
agents to learn from spatial regularities in the observation images. It is |
|
possible to use visual and vector observations with the same agent. |
|
|
|
Agents using visual observations can capture state of arbitrary complexity and |
|
are useful when the state is difficult to describe numerically. However, they |
|
are also typically less efficient and slower to train, and sometimes don't |
|
succeed at all as compared to vector observations. As such, they should only be |
|
used when it is not possible to properly define the problem using vector or |
|
ray-cast observations. |
|
|
|
Visual observations can be derived from Cameras or RenderTextures within your |
|
scene. To add a visual observation to an Agent, add either a Camera Sensor |
|
Component or RenderTextures Sensor Component to the Agent. Then drag the camera |
|
or render texture you want to add to the `Camera` or `RenderTexture` field. You |
|
can have more than one camera or render texture and even use a combination of |
|
both attached to an Agent. For each visual observation, set the width and height |
|
of the image (in pixels) and whether or not the observation is color or |
|
grayscale. |
|
|
|
![Agent Camera](images/visual-observation.png) |
|
|
|
or |
|
|
|
![Agent RenderTexture](images/visual-observation-rendertexture.png) |
|
|
|
Each Agent that uses the same Policy must have the same number of visual |
|
observations, and they must all have the same resolutions (including whether or |
|
not they are grayscale). Additionally, each Sensor Component on an Agent must |
|
have a unique name so that they can be sorted deterministically (the name must |
|
be unique for that Agent, but multiple Agents can have a Sensor Component with |
|
the same name). |
|
|
|
Visual observations also support stacking, by specifying `Observation Stacks` |
|
to a value greater than 1. The visual observations from the last `stackSize` |
|
steps will be stacked on the last dimension (channel dimension). |
|
|
|
When using `RenderTexture` visual observations, a handy feature for debugging is |
|
adding a `Canvas`, then adding a `Raw Image` with it's texture set to the |
|
Agent's `RenderTexture`. This will render the agent observation on the game |
|
screen. |
|
|
|
![RenderTexture with Raw Image](images/visual-observation-rawimage.png) |
|
|
|
The [GridWorld environment](Learning-Environment-Examples.md#gridworld) is an |
|
example on how to use a RenderTexture for both debugging and observation. Note |
|
that in this example, a Camera is rendered to a RenderTexture, which is then |
|
used for observations and debugging. To update the RenderTexture, the Camera |
|
must be asked to render every time a decision is requested within the game code. |
|
When using Cameras as observations directly, this is done automatically by the |
|
Agent. |
|
|
|
![Agent RenderTexture Debug](images/gridworld.png) |
|
|
|
#### Visual Observation Summary & Best Practices |
|
|
|
- To collect visual observations, attach `CameraSensor` or `RenderTextureSensor` |
|
components to the agent GameObject. |
|
- Visual observations should generally only be used when vector observations are |
|
not sufficient. |
|
- Image size should be kept as small as possible, without the loss of needed |
|
details for decision making. |
|
- Images should be made grayscale in situations where color information is not |
|
needed for making informed decisions. |
|
|
|
### Raycast Observations |
|
|
|
Raycasts are another possible method for providing observations to an agent. |
|
This can be easily implemented by adding a `RayPerceptionSensorComponent3D` (or |
|
`RayPerceptionSensorComponent2D`) to the Agent GameObject. |
|
|
|
During observations, several rays (or spheres, depending on settings) are cast |
|
into the physics world, and the objects that are hit determine the observation |
|
vector that is produced. |
|
|
|
![Agent with two RayPerceptionSensorComponent3Ds](images/ray_perception.png) |
|
|
|
Both sensor components have several settings: |
|
|
|
- _Detectable Tags_ A list of strings corresponding to the types of objects that |
|
the Agent should be able to distinguish between. For example, in the WallJump |
|
example, we use "wall", "goal", and "block" as the list of objects to detect. |
|
- _Rays Per Direction_ Determines the number of rays that are cast. One ray is |
|
always cast forward, and this many rays are cast to the left and right. |
|
- _Max Ray Degrees_ The angle (in degrees) for the outermost rays. 90 degrees |
|
corresponds to the left and right of the agent. |
|
- _Sphere Cast Radius_ The size of the sphere used for sphere casting. If set to |
|
0, rays will be used instead of spheres. Rays may be more efficient, |
|
especially in complex scenes. |
|
- _Ray Length_ The length of the casts |
|
- _Ray Layer Mask_ The [LayerMask](https://docs.unity3d.com/ScriptReference/LayerMask.html) |
|
passed to the raycast or spherecast. This can be used to ignore certain types |
|
of objects when casting. |
|
- _Observation Stacks_ The number of previous results to "stack" with the cast |
|
results. Note that this can be independent of the "Stacked Vectors" setting in |
|
`Behavior Parameters`. |
|
- _Start Vertical Offset_ (3D only) The vertical offset of the ray start point. |
|
- _End Vertical Offset_ (3D only) The vertical offset of the ray end point. |
|
- _Alternating Ray Order_ Alternating is the default, it gives an order of (0, |
|
-delta, delta, -2*delta, 2*delta, ..., -n*delta, n*delta). If alternating is |
|
disabled the order is left to right (-n*delta, -(n-1)*delta, ..., -delta, 0, |
|
delta, ..., (n-1)*delta, n*delta). For general usage there is no difference |
|
but if using custom models the left-to-right layout that matches the spatial |
|
structuring can be preferred (e.g. for processing with conv nets). |
|
- _Use Batched Raycasts_ (3D only) Whether to use batched raycasts. Enable to use batched raycasts and the jobs system. |
|
|
|
In the example image above, the Agent has two `RayPerceptionSensorComponent3D`s. |
|
Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components had |
|
a vertical offset, so the Agent can tell whether it's clear to jump over the |
|
wall. |
|
|
|
The total size of the created observations is |
|
|
|
``` |
|
(Observation Stacks) * (1 + 2 * Rays Per Direction) * (Num Detectable Tags + 2) |
|
``` |
|
|
|
so the number of rays and tags should be kept as small as possible to reduce the |
|
amount of data used. Note that this is separate from the State Size defined in |
|
`Behavior Parameters`, so you don't need to worry about the formula above when |
|
setting the State Size. |
|
|
|
#### RayCast Observation Summary & Best Practices |
|
|
|
- Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to |
|
use. |
|
- This observation type is best used when there is relevant spatial information |
|
for the agent that doesn't require a fully rendered image to convey. |
|
- Use as few rays and tags as necessary to solve the problem in order to improve |
|
learning stability and agent performance. |
|
- If you run into performance issues, try using batched raycasts by enabling the _Use Batched Raycast_ setting. |
|
(Only available for 3D ray perception sensors.) |
|
|
|
### Grid Observations |
|
Grid-base observations combine the advantages of 2D spatial representation in |
|
visual observations, and the flexibility of defining detectable objects in |
|
RayCast observations. The sensor uses a set of box queries in a grid shape and |
|
gives a top-down 2D view around the agent. This can be implemented by adding a |
|
`GridSensorComponent` to the Agent GameObject. |
|
|
|
During observations, the sensor detects the presence of detectable objects in |
|
each cell and encode that into one-hot representation. The collected information |
|
from each cell forms a 3D tensor observation and will be fed into the |
|
convolutional neural network (CNN) of the agent policy just like visual |
|
observations. |
|
|
|
![Agent with GridSensorComponent](images/grid_sensor.png) |
|
|
|
The sensor component has the following settings: |
|
- _Cell Scale_ The scale of each cell in the grid. |
|
- _Grid Size_ Number of cells on each side of the grid. |
|
- _Agent Game Object_ The Agent that holds the grid sensor. This is used to |
|
disambiguate objects with the same tag as the agent so that the agent doesn't |
|
detect itself. |
|
- _Rotate With Agent_ Whether the grid rotates with the Agent. |
|
- _Detectable Tags_ A list of strings corresponding to the types of objects that |
|
the Agent should be able to distinguish between. |
|
- _Collider Mask_ The [LayerMask](https://docs.unity3d.com/ScriptReference/LayerMask.html) |
|
passed to the collider detection. This can be used to ignore certain types |
|
of objects. |
|
- _Initial Collider Buffer Size_ The initial size of the Collider buffer used |
|
in the non-allocating Physics calls for each cell. |
|
- _Max Collider Buffer Size_ The max size of the Collider buffer used in the |
|
non-allocating Physics calls for each cell. |
|
|
|
The observation for each grid cell is a one-hot encoding of the detected object. |
|
The total size of the created observations is |
|
|
|
``` |
|
GridSize.x * GridSize.z * Num Detectable Tags |
|
``` |
|
|
|
so the number of detectable tags and size of the grid should be kept as small as |
|
possible to reduce the amount of data used. This makes a trade-off between the |
|
granularity of the observation and training speed. |
|
|
|
To allow more variety of observations that grid sensor can capture, the |
|
`GridSensorComponent` and the underlying `GridSensorBase` also provides interfaces |
|
that can be overridden to collect customized observation from detected objects. |
|
See the doc on |
|
[extending grid Sensors](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/com.unity.ml-agents.extensions/Documentation~/CustomGridSensors.md) |
|
for more details on custom grid sensors. |
|
|
|
__Note__: The `GridSensor` only works in 3D environments and will not behave |
|
properly in 2D environments. |
|
|
|
#### Grid Observation Summary & Best Practices |
|
|
|
- Attach `GridSensorComponent` to use. |
|
- This observation type is best used when there is relevant non-visual spatial information that |
|
can be best captured in 2D representations. |
|
- Use as small grid size and as few tags as necessary to solve the problem in order to improve |
|
learning stability and agent performance. |
|
- Do not use `GridSensor` in a 2D game. |
|
|
|
### Variable Length Observations |
|
|
|
It is possible for agents to collect observations from a varying number of |
|
GameObjects by using a `BufferSensor`. |
|
You can add a `BufferSensor` to your Agent by adding a `BufferSensorComponent` to |
|
its GameObject. |
|
The `BufferSensor` can be useful in situations in which the Agent must pay |
|
attention to a varying number of entities (for example, a varying number of |
|
enemies or projectiles). |
|
On the trainer side, the `BufferSensor` |
|
is processed using an attention module. More information about attention |
|
mechanisms can be found [here](https://arxiv.org/abs/1706.03762). Training or |
|
doing inference with variable length observations can be slower than using |
|
a flat vector observation. However, attention mechanisms enable solving |
|
problems that require comparative reasoning between entities in a scene |
|
such as our [Sorter environment](Learning-Environment-Examples.md#sorter). |
|
Note that even though the `BufferSensor` can process a variable number of |
|
entities, you still need to define a maximum number of entities. This is |
|
because our network architecture requires to know what the shape of the |
|
observations will be. If fewer entities are observed than the maximum, the |
|
observation will be padded with zeros and the trainer will ignore |
|
the padded observations. Note that attention layers are invariant to |
|
the order of the entities, so there is no need to properly "order" the |
|
entities before feeding them into the `BufferSensor`. |
|
|
|
The `BufferSensorComponent` Editor inspector has two arguments: |
|
- `Observation Size` : This is how many floats each entities will be |
|
represented with. This number is fixed and all entities must |
|
have the same representation. For example, if the entities you want to |
|
put into the `BufferSensor` have for relevant information position and |
|
speed, then the `Observation Size` should be 6 floats. |
|
- `Maximum Number of Entities` : This is the maximum number of entities |
|
the `BufferSensor` will be able to collect. |
|
|
|
To add an entity's observations to a `BufferSensorComponent`, you need |
|
to call `BufferSensorComponent.AppendObservation()` in the |
|
Agent.CollectObservations() method |
|
with a float array of size `Observation Size` as argument. |
|
|
|
__Note__: Currently, the observations put into the `BufferSensor` are |
|
not normalized, you will need to normalize your observations manually |
|
between -1 and 1. |
|
|
|
#### Variable Length Observation Summary & Best Practices |
|
- Attach `BufferSensorComponent` to use. |
|
- Call `BufferSensorComponent.AppendObservation()` in the |
|
Agent.CollectObservations() methodto add the observations |
|
of an entity to the `BufferSensor`. |
|
- Normalize the entities observations before feeding them into the `BufferSensor`. |
|
|
|
### Goal Signal |
|
|
|
It is possible for agents to collect observations that will be treated as "goal signal". |
|
A goal signal is used to condition the policy of the agent, meaning that if the goal |
|
changes, the policy (i.e. the mapping from observations to actions) will change |
|
as well. Note that this is true |
|
for any observation since all observations influence the policy of the Agent to |
|
some degree. But by specifying a goal signal explicitly, we can make this conditioning |
|
more important to the agent. This feature can be used in settings where an agent |
|
must learn to solve different tasks that are similar by some aspects because the |
|
agent will learn to reuse learnings from different tasks to generalize better. |
|
In Unity, you can specify that a `VectorSensor` or |
|
a `CameraSensor` is a goal by attaching a `VectorSensorComponent` or a |
|
`CameraSensorComponent` to the Agent and selecting `Goal Signal` as `Observation Type`. |
|
On the trainer side, there are two different ways to condition the policy. This |
|
setting is determined by the |
|
[conditioning_type parameter](Training-Configuration-File.md#common-trainer-configurations). |
|
If set to `hyper` (default) a [HyperNetwork](https://arxiv.org/pdf/1609.09106.pdf) |
|
will be used to generate some of the |
|
weights of the policy using the goal observations as input. Note that using a |
|
HyperNetwork requires a lot of computations, it is recommended to use a smaller |
|
number of hidden units in the policy to alleviate this. |
|
If set to `none` the goal signal will be considered as regular observations. |
|
For an example on how to use a goal signal, see the |
|
[GridWorld example](Learning-Environment-Examples.md#gridworld). |
|
|
|
#### Goal Signal Summary & Best Practices |
|
- Attach a `VectorSensorComponent` or `CameraSensorComponent` to an agent and |
|
set the observation type to goal to use the feature. |
|
- Set the conditioning_type parameter in the training configuration. |
|
- Reduce the number of hidden units in the network when using the HyperNetwork |
|
conditioning type. |
|
|
|
## Actions and Actuators |
|
|
|
An action is an instruction from the Policy that the agent carries out. The |
|
action is passed to the an `IActionReceiver` (either an `Agent` or an `IActuator`) |
|
as the `ActionBuffers` parameter when the Academy invokes the |
|
`IActionReciever.OnActionReceived()` function. |
|
There are two types of actions supported: **Continuous** and **Discrete**. |
|
|
|
Neither the Policy nor the training algorithm know anything about what the |
|
action values themselves mean. The training algorithm simply tries different |
|
values for the action list and observes the affect on the accumulated rewards |
|
over time and many training episodes. Thus, the only place actions are defined |
|
for an Agent is in the `OnActionReceived()` function. |
|
|
|
For example, if you designed an agent to move in two dimensions, you could use |
|
either continuous or the discrete actions. In the continuous case, you |
|
would set the action size to two (one for each dimension), and the |
|
agent's Policy would output an action with two floating point values. In the |
|
discrete case, you would use one Branch with a size of four (one for each |
|
direction), and the Policy would create an action array containing a single |
|
element with a value ranging from zero to three. Alternatively, you could create |
|
two branches of size two (one for horizontal movement and one for vertical |
|
movement), and the Policy would output an action array containing two elements |
|
with values ranging from zero to one. You could alternatively use a combination of continuous |
|
and discrete actions e.g., using one continuous action for horizontal movement |
|
and a discrete branch of size two for the vertical movement. |
|
|
|
Note that when you are programming actions for an agent, it is often helpful to |
|
test your action logic using the `Heuristic()` method of the Agent, which lets |
|
you map keyboard commands to actions. |
|
|
|
### Continuous Actions |
|
|
|
When an Agent's Policy has **Continuous** actions, the |
|
`ActionBuffers.ContinuousActions` passed to the Agent's `OnActionReceived()` function |
|
is an array with length equal to the `Continuous Action Size` property value. The |
|
individual values in the array have whatever meanings that you ascribe to them. |
|
If you assign an element in the array as the speed of an Agent, for example, the |
|
training process learns to control the speed of the Agent through this |
|
parameter. |
|
|
|
The [3DBall example](Learning-Environment-Examples.md#3dball-3d-balance-ball) uses |
|
continuous actions with two control values. |
|
|
|
![3DBall](images/balance.png) |
|
|
|
These control values are applied as rotation to the cube: |
|
|
|
```csharp |
|
public override void OnActionReceived(ActionBuffers actionBuffers) |
|
{ |
|
var actionZ = 2f * Mathf.Clamp(actionBuffers.ContinuousActions[0], -1f, 1f); |
|
var actionX = 2f * Mathf.Clamp(actionBuffers.ContinuousActions[1], -1f, 1f); |
|
|
|
gameObject.transform.Rotate(new Vector3(0, 0, 1), actionZ); |
|
gameObject.transform.Rotate(new Vector3(1, 0, 0), actionX); |
|
} |
|
``` |
|
|
|
By default the output from our provided PPO algorithm pre-clamps the values of |
|
`ActionBuffers.ContinuousActions` into the [-1, 1] range. It is a best practice to manually clip |
|
these as well, if you plan to use a 3rd party algorithm with your environment. |
|
As shown above, you can scale the control values as needed after clamping them. |
|
|
|
### Discrete Actions |
|
|
|
When an Agent's Policy uses **discrete** actions, the |
|
`ActionBuffers.DiscreteActions` passed to the Agent's `OnActionReceived()` function |
|
is an array of integers with length equal to `Discrete Branch Size`. When defining the discrete actions, `Branches` |
|
is an array of integers, each value corresponds to the number of possibilities for each branch. |
|
|
|
For example, if we wanted an Agent that can move in a plane and jump, we could |
|
define two branches (one for motion and one for jumping) because we want our |
|
agent be able to move **and** jump concurrently. We define the first branch to |
|
have 5 possible actions (don't move, go left, go right, go backward, go forward) |
|
and the second one to have 2 possible actions (don't jump, jump). The |
|
`OnActionReceived()` method would look something like: |
|
|
|
```csharp |
|
// Get the action index for movement |
|
int movement = actionBuffers.DiscreteActions[0]; |
|
// Get the action index for jumping |
|
int jump = actionBuffers.DiscreteActions[1]; |
|
|
|
// Look up the index in the movement action list: |
|
if (movement == 1) { directionX = -1; } |
|
if (movement == 2) { directionX = 1; } |
|
if (movement == 3) { directionZ = -1; } |
|
if (movement == 4) { directionZ = 1; } |
|
// Look up the index in the jump action list: |
|
if (jump == 1 && IsGrounded()) { directionY = 1; } |
|
|
|
// Apply the action results to move the Agent |
|
gameObject.GetComponent<Rigidbody>().AddForce( |
|
new Vector3( |
|
directionX * 40f, directionY * 300f, directionZ * 40f)); |
|
``` |
|
|
|
#### Masking Discrete Actions |
|
|
|
When using Discrete Actions, it is possible to specify that some actions are |
|
impossible for the next decision. When the Agent is controlled by a neural |
|
network, the Agent will be unable to perform the specified action. Note that |
|
when the Agent is controlled by its Heuristic, the Agent will still be able to |
|
decide to perform the masked action. In order to disallow an action, override |
|
the `Agent.WriteDiscreteActionMask()` virtual method, and call |
|
`SetActionEnabled()` on the provided `IDiscreteActionMask`: |
|
|
|
```csharp |
|
public override void WriteDiscreteActionMask(IDiscreteActionMask actionMask) |
|
{ |
|
actionMask.SetActionEnabled(branch, actionIndex, isEnabled); |
|
} |
|
``` |
|
|
|
Where: |
|
|
|
- `branch` is the index (starting at 0) of the branch on which you want to |
|
allow or disallow the action |
|
- `actionIndex` is the index of the action that you want to allow or disallow. |
|
- `isEnabled` is a bool indicating whether the action should be allowed or now. |
|
|
|
For example, if you have an Agent with 2 branches and on the first branch |
|
(branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_ |
|
and _"change weapon"_. Then with the code bellow, the Agent will either _"do |
|
nothing"_ or _"change weapon"_ for their next decision (since action index 1 and 2 |
|
are masked) |
|
|
|
```csharp |
|
actionMask.SetActionEnabled(0, 1, false); |
|
actionMask.SetActionEnabled(0, 2, false); |
|
``` |
|
|
|
Notes: |
|
|
|
- You can call `SetActionEnabled` multiple times if you want to put masks on multiple |
|
branches. |
|
- At each step, the state of an action is reset and enabled by default. |
|
- You cannot mask all the actions of a branch. |
|
- You cannot mask actions in continuous control. |
|
|
|
|
|
### IActuator interface and ActuatorComponents |
|
The Actuator API allows users to abstract behavior out of Agents and in to |
|
components (similar to the ISensor API). The `IActuator` interface and `Agent` |
|
class both implement the `IActionReceiver` interface to allow for backward compatibility |
|
with the current `Agent.OnActionReceived`. |
|
This means you will not have to change your code until you decide to use the `IActuator` API. |
|
|
|
Like the `ISensor` interface, the `IActuator` interface is intended for advanced users. |
|
|
|
The `ActuatorComponent` abstract class is used to create the actual `IActuator` at |
|
runtime. It must be attached to the same `GameObject` as the `Agent`, or to a |
|
child `GameObject`. Actuators and all of their data structures are initialized |
|
during `Agent.Initialize`. This was done to prevent an unexpected allocations at runtime. |
|
|
|
You can find an example of an `IActuator` implementation in the `Basic` example scene. |
|
**NOTE**: you do not need to adjust the Actions in the Agent's |
|
`Behavior Parameters` when using an `IActuator` and `ActuatorComponents`. |
|
|
|
Internally, `Agent.OnActionReceived` uses an `IActuator` to send actions to the Agent, |
|
although this is mostly abstracted from the user. |
|
|
|
|
|
### Actions Summary & Best Practices |
|
|
|
- Agents can use `Discrete` and/or `Continuous` actions. |
|
- Discrete actions can have multiple action branches, and it's possible to mask |
|
certain actions so that they won't be taken. |
|
- In general, fewer actions will make for easier learning. |
|
- Be sure to set the Continuous Action Size and Discrete Branch Size to the desired |
|
number for each type of action, and not greater, as doing the latter can interfere with the |
|
efficiency of the training process. |
|
- Continuous action values should be clipped to an |
|
appropriate range. The provided PPO model automatically clips these values |
|
between -1 and 1, but third party training systems may not do so. |
|
|
|
## Rewards |
|
|
|
In reinforcement learning, the reward is a signal that the agent has done |
|
something right. The PPO reinforcement learning algorithm works by optimizing |
|
the choices an agent makes such that the agent earns the highest cumulative |
|
reward over time. The better your reward mechanism, the better your agent will |
|
learn. |
|
|
|
**Note:** Rewards are not used during inference by an Agent using a trained |
|
model and is also not used during imitation learning. |
|
|
|
Perhaps the best advice is to start simple and only add complexity as needed. In |
|
general, you should reward results rather than actions you think will lead to |
|
the desired results. You can even use the Agent's Heuristic to control the Agent |
|
while watching how it accumulates rewards. |
|
|
|
Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` |
|
methods on the agent. The reward assigned between each decision should be in the |
|
range [-1,1]. Values outside this range can lead to unstable training. The |
|
`reward` value is reset to zero when the agent receives a new decision. If there |
|
are multiple calls to `AddReward()` for a single agent decision, the rewards |
|
will be summed together to evaluate how good the previous decision was. The |
|
`SetReward()` will override all previous rewards given to an agent since the |
|
previous decision. |
|
|
|
### Examples |
|
|
|
You can examine the `OnActionReceived()` functions defined in the |
|
[example environments](Learning-Environment-Examples.md) to see how those |
|
projects allocate rewards. |
|
|
|
The `GridAgent` class in the |
|
[GridWorld example](Learning-Environment-Examples.md#gridworld) uses a very |
|
simple reward system: |
|
|
|
```csharp |
|
Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position, |
|
new Vector3(0.3f, 0.3f, 0.3f)); |
|
if (hitObjects.Where(col => col.gameObject.tag == "goal").ToArray().Length == 1) |
|
{ |
|
AddReward(1.0f); |
|
EndEpisode(); |
|
} |
|
else if (hitObjects.Where(col => col.gameObject.tag == "pit").ToArray().Length == 1) |
|
{ |
|
AddReward(-1f); |
|
EndEpisode(); |
|
} |
|
``` |
|
|
|
The agent receives a positive reward when it reaches the goal and a negative |
|
reward when it falls into the pit. Otherwise, it gets no rewards. This is an |
|
example of a _sparse_ reward system. The agent must explore a lot to find the |
|
infrequent reward. |
|
|
|
In contrast, the `AreaAgent` in the |
|
[Area example](Learning-Environment-Examples.md#push-block) gets a small |
|
negative reward every step. In order to get the maximum reward, the agent must |
|
finish its task of reaching the goal square as quickly as possible: |
|
|
|
```csharp |
|
AddReward( -0.005f); |
|
MoveAgent(act); |
|
|
|
if (gameObject.transform.position.y < 0.0f || |
|
Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f || |
|
Mathf.Abs(gameObject.transform.position.z + 5 - area.transform.position.z) > 8) |
|
{ |
|
AddReward(-1f); |
|
EndEpisode(); |
|
} |
|
``` |
|
|
|
The agent also gets a larger negative penalty if it falls off the playing |
|
surface. |
|
|
|
The `Ball3DAgent` in the |
|
[3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a |
|
similar approach, but allocates a small positive reward as long as the agent |
|
balances the ball. The agent can maximize its rewards by keeping the ball on the |
|
platform: |
|
|
|
```csharp |
|
|
|
SetReward(0.1f); |
|
|
|
// When ball falls mark Agent as finished and give a negative penalty |
|
if ((ball.transform.position.y - gameObject.transform.position.y) < -2f || |
|
Mathf.Abs(ball.transform.position.x - gameObject.transform.position.x) > 3f || |
|
Mathf.Abs(ball.transform.position.z - gameObject.transform.position.z) > 3f) |
|
{ |
|
SetReward(-1f); |
|
EndEpisode(); |
|
|
|
} |
|
``` |
|
|
|
The `Ball3DAgent` also assigns a negative penalty when the ball falls off the |
|
platform. |
|
|
|
Note that all of these environments make use of the `EndEpisode()` method, which |
|
manually terminates an episode when a termination condition is reached. This can |
|
be called independently of the `Max Step` property. |
|
|
|
### Rewards Summary & Best Practices |
|
|
|
- Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()` |
|
to overwrite any previous rewards accumulate between decisions. |
|
- The magnitude of any given reward should typically not be greater than 1.0 in |
|
order to ensure a more stable learning process. |
|
- Positive rewards are often more helpful to shaping the desired behavior of an |
|
agent than negative rewards. Excessive negative rewards can result in the |
|
agent failing to learn any meaningful behavior. |
|
- For locomotion tasks, a small positive reward (+0.1) for forward velocity is |
|
typically used. |
|
- If you want the agent to finish a task quickly, it is often helpful to provide |
|
a small penalty every step (-0.05) that the agent does not complete the task. |
|
In this case completion of the task should also coincide with the end of the |
|
episode by calling `EndEpisode()` on the agent when it has accomplished its |
|
goal. |
|
|
|
## Agent Properties |
|
|
|
![Agent Inspector](images/3dball_learning_brain.png) |
|
|
|
- `Behavior Parameters` - The parameters dictating what Policy the Agent will |
|
receive. |
|
- `Behavior Name` - The identifier for the behavior. Agents with the same |
|
behavior name will learn the same policy. |
|
- `Vector Observation` |
|
- `Space Size` - Length of vector observation for the Agent. |
|
- `Stacked Vectors` - The number of previous vector observations that will |
|
be stacked and used collectively for decision making. This results in the |
|
effective size of the vector observation being passed to the Policy being: |
|
_Space Size_ x _Stacked Vectors_. |
|
- `Actions` |
|
- `Continuous Actions` - The number of concurrent continuous actions that |
|
the Agent can take. |
|
- `Discrete Branches` - An array of integers, defines multiple concurrent |
|
discrete actions. The values in the `Discrete Branches` array correspond |
|
to the number of possible discrete values for each action branch. |
|
- `Model` - The neural network model used for inference (obtained after |
|
training) |
|
- `Inference Device` - Whether to use CPU or GPU to run the model during |
|
inference |
|
- `Behavior Type` - Determines whether the Agent will do training, inference, |
|
or use its Heuristic() method: |
|
- `Default` - the Agent will train if they connect to a python trainer, |
|
otherwise they will perform inference. |
|
- `Heuristic Only` - the Agent will always use the `Heuristic()` method. |
|
- `Inference Only` - the Agent will always perform inference. |
|
- `Team ID` - Used to define the team for self-play |
|
- `Use Child Sensors` - Whether to use all Sensor components attached to child |
|
GameObjects of this Agent. |
|
- `Max Step` - The per-agent maximum number of steps. Once this number is |
|
reached, the Agent will be reset. |
|
|
|
## Destroying an Agent |
|
|
|
You can destroy an Agent GameObject during the simulation. Make sure that there |
|
is always at least one Agent training at all times by either spawning a new |
|
Agent every time one is destroyed or by re-spawning new Agents when the whole |
|
environment resets. |
|
|
|
## Defining Multi-agent Scenarios |
|
|
|
### Teams for Adversarial Scenarios |
|
|
|
Self-play is triggered by including the self-play hyperparameter hierarchy in |
|
the [trainer configuration](Training-ML-Agents.md#training-configurations). To |
|
distinguish opposing agents, set the team ID to different integer values in the |
|
behavior parameters script on the agent prefab. |
|
|
|
<p align="center"> |
|
<img src="images/team_id.png" |
|
alt="Team ID" |
|
width="375" border="10" /> |
|
</p> |
|
|
|
**_Team ID must be 0 or an integer greater than 0._** |
|
|
|
In symmetric games, since all agents (even on opposing teams) will share the |
|
same policy, they should have the same 'Behavior Name' in their Behavior |
|
Parameters Script. In asymmetric games, they should have a different Behavior |
|
Name in their Behavior Parameters script. Note, in asymmetric games, the agents |
|
must have both different Behavior Names _and_ different team IDs! |
|
|
|
For examples of how to use this feature, you can see the trainer configurations |
|
and agent prefabs for our Tennis and Soccer environments. Tennis and Soccer |
|
provide examples of symmetric games. To train an asymmetric game, specify |
|
trainer configurations for each of your behavior names and include the self-play |
|
hyperparameter hierarchy in both. |
|
|
|
### Groups for Cooperative Scenarios |
|
|
|
Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`, |
|
typically in an environment controller or similar script, and adding agents to it |
|
using the `RegisterAgent(Agent agent)` method. Note that all agents added to the same `SimpleMultiAgentGroup` |
|
must have the same behavior name and Behavior Parameters. Using `SimpleMultiAgentGroup` enables the |
|
agents within a group to learn how to work together to achieve a common goal (i.e., |
|
maximize a group-given reward), even if one or more of the group members are removed |
|
before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes |
|
at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and |
|
`GroupEpisodeInterrupted()` methods. For example: |
|
|
|
```csharp |
|
// Create a Multi Agent Group in Start() or Initialize() |
|
m_AgentGroup = new SimpleMultiAgentGroup(); |
|
|
|
// Register agents in group at the beginning of an episode |
|
for (var agent in AgentList) |
|
{ |
|
m_AgentGroup.RegisterAgent(agent); |
|
} |
|
|
|
// if the team scores a goal |
|
m_AgentGroup.AddGroupReward(rewardForGoal); |
|
|
|
// If the goal is reached and the episode is over |
|
m_AgentGroup.EndGroupEpisode(); |
|
ResetScene(); |
|
|
|
// If time ran out and we need to interrupt the episode |
|
m_AgentGroup.GroupEpisodeInterrupted(); |
|
ResetScene(); |
|
``` |
|
|
|
Multi Agent Groups should be used with the MA-POCA trainer, which is explicitly designed to train |
|
cooperative environments. This can be enabled by using the `poca` trainer - see the |
|
[training configurations](Training-Configuration-File.md) doc for more information on |
|
configuring MA-POCA. When using MA-POCA, agents which are deactivated or removed from the Scene |
|
during the episode will still learn to contribute to the group's long term rewards, even |
|
if they are not active in the scene to experience them. |
|
|
|
See the [Cooperative Push Block](Learning-Environment-Examples.md#cooperative-push-block) environment |
|
for an example of how to use Multi Agent Groups, and the |
|
[Dungeon Escape](Learning-Environment-Examples.md#dungeon-escape) environment for an example of |
|
how the Multi Agent Group can be used with agents that are removed from the scene mid-episode. |
|
|
|
**NOTE**: Groups differ from Teams (for competitive settings) in the following way - Agents |
|
working together should be added to the same Group, while agents playing against each other |
|
should be given different Team Ids. If in the Scene there is one playing field and two teams, |
|
there should be two Groups, one for each team, and each team should be assigned a different |
|
Team Id. If this playing field is duplicated many times in the Scene (e.g. for training |
|
speedup), there should be two Groups _per playing field_, and two unique Team Ids |
|
_for the entire Scene_. In environments with both Groups and Team Ids configured, MA-POCA and |
|
self-play can be used together for training. In the diagram below, there are two agents on each team, |
|
and two playing fields where teams are pitted against each other. All the blue agents should share a Team Id |
|
(and the orange ones a different ID), and there should be four group managers, one per pair of agents. |
|
|
|
<p align="center"> |
|
<img src="images/groupmanager_teamid.png" |
|
alt="Group Manager vs Team Id" |
|
width="650" border="10" /> |
|
</p> |
|
|
|
Please see the [SoccerTwos](Learning-Environment-Examples.md#soccer-twos) environment for an example. |
|
|
|
#### Cooperative Behaviors Notes and Best Practices |
|
* An agent can only be registered to one MultiAgentGroup at a time. If you want to re-assign an |
|
agent from one group to another, you have to unregister it from the current group first. |
|
|
|
* Agents with different behavior names in the same group are not supported. |
|
|
|
* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0. |
|
Instead, handle Max Steps using the MultiAgentGroup by ending the episode for the entire |
|
Group using `GroupEpisodeInterrupted()`. |
|
|
|
* `EndGroupEpisode` and `GroupEpisodeInterrupted` do the same job in the game, but has |
|
slightly different effect on the training. If the episode is completed, you would want to call |
|
`EndGroupEpisode`. But if the episode is not over but it has been running for enough steps, i.e. |
|
reaching max step, you would call `GroupEpisodeInterrupted`. |
|
|
|
* If an agent finished earlier, e.g. completed tasks/be removed/be killed in the game, do not call |
|
`EndEpisode()` on the Agent. Instead, disable the agent and re-enable it when the next episode starts, |
|
or destroy the agent entirely. This is because calling `EndEpisode()` will call `OnEpisodeBegin()`, which |
|
will reset the agent immediately. While it is possible to call `EndEpisode()` in this way, it is usually not the |
|
desired behavior when training groups of agents. |
|
|
|
* If an agent that was disabled in a scene needs to be re-enabled, it must be re-registered to the MultiAgentGroup. |
|
|
|
* Group rewards are meant to reinforce agents to act in the group's best interest instead of |
|
individual ones, and are treated differently than individual agent rewards during |
|
training. So calling `AddGroupReward()` is not equivalent to calling agent.AddReward() on each agent |
|
in the group. |
|
|
|
* You can still add incremental rewards to agents using `Agent.AddReward()` if they are |
|
in a Group. These rewards will only be given to those agents and are received when the |
|
Agent is active. |
|
|
|
* Environments which use Multi Agent Groups can be trained using PPO or SAC, but agents will |
|
not be able to learn from group rewards after deactivation/removal, nor will they behave as cooperatively. |
|
|
|
## Recording Demonstrations |
|
|
|
In order to record demonstrations from an agent, add the |
|
`Demonstration Recorder` component to a GameObject in the scene which contains |
|
an `Agent` component. Once added, it is possible to name the demonstration that |
|
will be recorded from the agent. |
|
|
|
<p align="center"> |
|
<img src="images/demo_component.png" |
|
alt="Demonstration Recorder" |
|
width="650" border="10" /> |
|
</p> |
|
|
|
When `Record` is checked, a demonstration will be created whenever the scene is |
|
played from the Editor. Depending on the complexity of the task, anywhere from a |
|
few minutes or a few hours of demonstration data may be necessary to be useful |
|
for imitation learning. To specify an exact number of steps you want to record |
|
use the `Num Steps To Record` field and the editor will end your play session |
|
automatically once that many steps are recorded. If you set `Num Steps To Record` |
|
to `0` then recording will continue until you manually end the play session. Once |
|
the play session ends a `.demo` file will be created in the `Assets/Demonstrations` |
|
folder (by default). This file contains the demonstrations. Clicking on the file will |
|
provide metadata about the demonstration in the inspector. |
|
|
|
<p align="center"> |
|
<img src="images/demo_inspector.png" |
|
alt="Demonstration Inspector" |
|
width="375" border="10" /> |
|
</p> |
|
|
|
You can then specify the path to this file in your |
|
[training configurations](Training-Configuration-File.md#behavioral-cloning). |
|
|