|
# Unity ML-Agents Gym Wrapper |
|
|
|
A common way in which machine learning researchers interact with simulation |
|
environments is via a wrapper provided by OpenAI called `gym`. For more |
|
information on the gym interface, see [here](https://github.com/openai/gym). |
|
|
|
We provide a gym wrapper and instructions for using it with existing machine |
|
learning algorithms which utilize gym. Our wrapper provides interfaces on top of |
|
our `UnityEnvironment` class, which is the default way of interfacing with a |
|
Unity environment via Python. |
|
|
|
## Installation |
|
|
|
The gym wrapper is part of the `mlgents_envs` package. Please refer to the |
|
[mlagents_envs installation instructions](ML-Agents-Envs-README.md). |
|
|
|
|
|
## Using the Gym Wrapper |
|
|
|
The gym interface is available from `gym_unity.envs`. To launch an environment |
|
from the root of the project repository use: |
|
|
|
```python |
|
from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper |
|
|
|
env = UnityToGymWrapper(unity_env, uint8_visual, flatten_branched, allow_multiple_obs) |
|
``` |
|
|
|
- `unity_env` refers to the Unity environment to be wrapped. |
|
|
|
- `uint8_visual` refers to whether to output visual observations as `uint8` |
|
values (0-255). Many common Gym environments (e.g. Atari) do this. By default |
|
they will be floats (0.0-1.0). Defaults to `False`. |
|
|
|
- `flatten_branched` will flatten a branched discrete action space into a Gym |
|
Discrete. Otherwise, it will be converted into a MultiDiscrete. Defaults to |
|
`False`. |
|
|
|
- `allow_multiple_obs` will return a list of observations. The first elements |
|
contain the visual observations and the last element contains the array of |
|
vector observations. If False the environment returns a single array (containing |
|
a single visual observations, if present, otherwise the vector observation). |
|
Defaults to `False`. |
|
|
|
- `action_space_seed` is the optional seed for action sampling. If non-None, will |
|
be used to set the random seed on created gym.Space instances. |
|
|
|
The returned environment `env` will function as a gym. |
|
|
|
## Limitations |
|
|
|
- It is only possible to use an environment with a **single** Agent. |
|
- By default, the first visual observation is provided as the `observation`, if |
|
present. Otherwise, vector observations are provided. You can receive all |
|
visual and vector observations by using the `allow_multiple_obs=True` option in |
|
the gym parameters. If set to `True`, you will receive a list of `observation` |
|
instead of only one. |
|
- The `TerminalSteps` or `DecisionSteps` output from the environment can still |
|
be accessed from the `info` provided by `env.step(action)`. |
|
- Stacked vector observations are not supported. |
|
- Environment registration for use with `gym.make()` is currently not supported. |
|
- Calling env.render() will not render a new frame of the environment. It will |
|
return the latest visual observation if using visual observations. |
|
|
|
## Running OpenAI Baselines Algorithms |
|
|
|
OpenAI provides a set of open-source maintained and tested Reinforcement |
|
Learning algorithms called the [Baselines](https://github.com/openai/baselines). |
|
|
|
Using the provided Gym wrapper, it is possible to train ML-Agents environments |
|
using these algorithms. This requires the creation of custom training scripts to |
|
launch each algorithm. In most cases these scripts can be created by making |
|
slight modifications to the ones provided for Atari and Mujoco environments. |
|
|
|
These examples were tested with baselines version 0.1.6. |
|
|
|
### Example - DQN Baseline |
|
|
|
In order to train an agent to play the `GridWorld` environment using the |
|
Baselines DQN algorithm, you first need to install the baselines package using |
|
pip: |
|
|
|
``` |
|
pip install git+git://github.com/openai/baselines |
|
``` |
|
|
|
Next, create a file called `train_unity.py`. Then create an `/envs/` directory |
|
and build the environment to that directory. For more information on |
|
building Unity environments, see |
|
[here](../docs/Learning-Environment-Executable.md). Note that because of |
|
limitations of the DQN baseline, the environment must have a single visual |
|
observation, a single discrete action and a single Agent in the scene. |
|
Add the following code to the `train_unity.py` file: |
|
|
|
```python |
|
import gym |
|
|
|
from baselines import deepq |
|
from baselines import logger |
|
|
|
from mlagents_envs.environment import UnityEnvironment |
|
from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper |
|
|
|
|
|
def main(): |
|
unity_env = UnityEnvironment( < path - to - environment >) |
|
env = UnityToGymWrapper(unity_env, uint8_visual=True) |
|
logger.configure('./logs') # Change to log in a different directory |
|
act = deepq.learn( |
|
env, |
|
"cnn", # For visual inputs |
|
lr=2.5e-4, |
|
total_timesteps=1000000, |
|
buffer_size=50000, |
|
exploration_fraction=0.05, |
|
exploration_final_eps=0.1, |
|
print_freq=20, |
|
train_freq=5, |
|
learning_starts=20000, |
|
target_network_update_freq=50, |
|
gamma=0.99, |
|
prioritized_replay=False, |
|
checkpoint_freq=1000, |
|
checkpoint_path='./logs', # Change to save model in a different directory |
|
dueling=True |
|
) |
|
print("Saving model to unity_model.pkl") |
|
act.save("unity_model.pkl") |
|
|
|
|
|
if __name__ == '__main__': |
|
main() |
|
``` |
|
|
|
To start the training process, run the following from the directory containing |
|
`train_unity.py`: |
|
|
|
```sh |
|
python -m train_unity |
|
``` |
|
|
|
### Other Algorithms |
|
|
|
Other algorithms in the Baselines repository can be run using scripts similar to |
|
the examples from the baselines package. In most cases, the primary changes |
|
needed to use a Unity environment are to import `UnityToGymWrapper`, and to |
|
replace the environment creation code, typically `gym.make()`, with a call to |
|
`UnityToGymWrapper(unity_environment)` passing the environment as input. |
|
|
|
A typical rule of thumb is that for vision-based environments, modification |
|
should be done to Atari training scripts, and for vector observation |
|
environments, modification should be done to Mujoco scripts. |
|
|
|
Some algorithms will make use of `make_env()` or `make_mujoco_env()` functions. |
|
You can define a similar function for Unity environments. An example of such a |
|
method using the PPO2 baseline: |
|
|
|
```python |
|
from mlagents_envs.environment import UnityEnvironment |
|
from mlagents_envs.envs import UnityToGymWrapper |
|
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv |
|
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv |
|
from baselines.bench import Monitor |
|
from baselines import logger |
|
import baselines.ppo2.ppo2 as ppo2 |
|
|
|
import os |
|
|
|
try: |
|
from mpi4py import MPI |
|
except ImportError: |
|
MPI = None |
|
|
|
|
|
def make_unity_env(env_directory, num_env, visual, start_index=0): |
|
""" |
|
Create a wrapped, monitored Unity environment. |
|
""" |
|
|
|
def make_env(rank, use_visual=True): # pylint: disable=C0111 |
|
def _thunk(): |
|
unity_env = UnityEnvironment(env_directory, base_port=5000 + rank) |
|
env = UnityToGymWrapper(unity_env, uint8_visual=True) |
|
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank))) |
|
return env |
|
|
|
return _thunk |
|
|
|
if visual: |
|
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)]) |
|
else: |
|
rank = MPI.COMM_WORLD.Get_rank() if MPI else 0 |
|
return DummyVecEnv([make_env(rank, use_visual=False)]) |
|
|
|
|
|
def main(): |
|
env = make_unity_env( < path - to - environment >, 4, True) |
|
ppo2.learn( |
|
network="mlp", |
|
env=env, |
|
total_timesteps=100000, |
|
lr=1e-3, |
|
) |
|
|
|
|
|
if __name__ == '__main__': |
|
main() |
|
``` |
|
|
|
## Run Google Dopamine Algorithms |
|
|
|
Google provides a framework [Dopamine](https://github.com/google/dopamine), and |
|
implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of |
|
Rainbow. Using the Gym wrapper, we can run Unity environments using Dopamine. |
|
|
|
First, after installing the Gym wrapper, clone the Dopamine repository. |
|
|
|
``` |
|
git clone https://github.com/google/dopamine |
|
``` |
|
|
|
Then, follow the appropriate install instructions as specified on |
|
[Dopamine's homepage](https://github.com/google/dopamine). Note that the |
|
Dopamine guide specifies using a virtualenv. If you choose to do so, make sure |
|
your unity_env package is also installed within the same virtualenv as Dopamine. |
|
|
|
### Adapting Dopamine's Scripts |
|
|
|
First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire |
|
`atari` folder, and name it something else (e.g. `unity`). If you choose the |
|
copy approach, be sure to change the package names in the import statements in |
|
`train.py` to your new directory. |
|
|
|
Within `run_experiment.py`, we will need to make changes to which environment is |
|
instantiated, just as in the Baselines example. At the top of the file, insert |
|
|
|
```python |
|
from mlagents_envs.environment import UnityEnvironment |
|
from mlagents_envs.envs import UnityToGymWrapper |
|
``` |
|
|
|
to import the Gym Wrapper. Navigate to the `create_atari_environment` method in |
|
the same file, and switch to instantiating a Unity environment by replacing the |
|
method with the following code. |
|
|
|
```python |
|
game_version = 'v0' if sticky_actions else 'v4' |
|
full_game_name = '{}NoFrameskip-{}'.format(game_name, game_version) |
|
unity_env = UnityEnvironment(<path-to-environment>) |
|
env = UnityToGymWrapper(unity_env, uint8_visual=True) |
|
return env |
|
``` |
|
|
|
`<path-to-environment>` is the path to your built Unity executable. For more |
|
information on building Unity environments, see |
|
[here](../docs/Learning-Environment-Executable.md), and note the Limitations |
|
section below. |
|
|
|
Note that we are not using the preprocessor from Dopamine, as it uses many |
|
Atari-specific calls. Furthermore, frame-skipping can be done from within Unity, |
|
rather than on the Python side. |
|
|
|
### Limitations |
|
|
|
Since Dopamine is designed around variants of DQN, it is only compatible with |
|
discrete action spaces, and specifically the Discrete Gym space. For |
|
environments that use branched discrete action spaces, you can enable the |
|
`flatten_branched` parameter in `UnityToGymWrapper`, which treats each |
|
combination of branched actions as separate actions. |
|
|
|
Furthermore, when building your environments, ensure that your Agent is using |
|
visual observations with greyscale enabled, and that the dimensions of the |
|
visual observations is 84 by 84 (matches the parameter found in `dqn_agent.py` |
|
and `rainbow_agent.py`). Dopamine's agents currently do not automatically adapt |
|
to the observation dimensions or number of channels. |
|
|
|
### Hyperparameters |
|
|
|
The hyperparameters provided by Dopamine are tailored to the Atari games, and |
|
you will likely need to adjust them for ML-Agents environments. Here is a sample |
|
`dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with |
|
a simple GridWorld. |
|
|
|
```python |
|
import dopamine.agents.rainbow.rainbow_agent |
|
import dopamine.unity.run_experiment |
|
import dopamine.replay_memory.prioritized_replay_buffer |
|
import gin.tf.external_configurables |
|
|
|
RainbowAgent.num_atoms = 51 |
|
RainbowAgent.stack_size = 1 |
|
RainbowAgent.vmax = 10. |
|
RainbowAgent.gamma = 0.99 |
|
RainbowAgent.update_horizon = 3 |
|
RainbowAgent.min_replay_history = 20000 # agent steps |
|
RainbowAgent.update_period = 5 |
|
RainbowAgent.target_update_period = 50 # agent steps |
|
RainbowAgent.epsilon_train = 0.1 |
|
RainbowAgent.epsilon_eval = 0.01 |
|
RainbowAgent.epsilon_decay_period = 50000 # agent steps |
|
RainbowAgent.replay_scheme = 'prioritized' |
|
RainbowAgent.tf_device = '/cpu:0' # use '/cpu:*' for non-GPU version |
|
RainbowAgent.optimizer = @tf.train.AdamOptimizer() |
|
|
|
tf.train.AdamOptimizer.learning_rate = 0.00025 |
|
tf.train.AdamOptimizer.epsilon = 0.0003125 |
|
|
|
Runner.game_name = "Unity" # any name can be used here |
|
Runner.sticky_actions = False |
|
Runner.num_iterations = 200 |
|
Runner.training_steps = 10000 # agent steps |
|
Runner.evaluation_steps = 500 # agent steps |
|
Runner.max_steps_per_episode = 27000 # agent steps |
|
|
|
WrappedPrioritizedReplayBuffer.replay_capacity = 1000000 |
|
WrappedPrioritizedReplayBuffer.batch_size = 32 |
|
``` |
|
|
|
This example assumed you copied `atari` to a separate folder named `unity`. |
|
Replace `unity` in `import dopamine.unity.run_experiment` with the folder you |
|
copied your `run_experiment.py` and `trainer.py` files to. If you directly |
|
modified the existing files, then use `atari` here. |
|
|
|
### Starting a Run |
|
|
|
You can now run Dopamine as you would normally: |
|
|
|
``` |
|
python -um dopamine.unity.train \ |
|
--agent_name=rainbow \ |
|
--base_dir=/tmp/dopamine \ |
|
--gin_files='dopamine/agents/rainbow/configs/rainbow.gin' |
|
``` |
|
|
|
Again, we assume that you've copied `atari` into a separate folder. Remember to |
|
replace `unity` with the directory you copied your files into. If you edited the |
|
Atari files directly, this should be `atari`. |
|
|
|
### Example: GridWorld |
|
|
|
As a baseline, here are rewards over time for the three algorithms provided with |
|
Dopamine as run on the GridWorld example environment. All Dopamine (DQN, |
|
Rainbow, C51) runs were done with the same epsilon, epsilon decay, replay |
|
history, training steps, and buffer settings as specified above. Note that the |
|
first 20000 steps are used to pre-fill the training buffer, and no learning |
|
happens. |
|
|
|
We provide results from our PPO implementation and the DQN from Baselines as |
|
reference. Note that all runs used the same greyscale GridWorld as Dopamine. For |
|
PPO, `num_layers` was set to 2, and all other hyperparameters are the default |
|
for GridWorld in `config/ppo/GridWorld.yaml`. For Baselines DQN, the provided |
|
hyperparameters in the previous section are used. Note that Baselines implements |
|
certain features (e.g. dueling-Q) that are not enabled in Dopamine DQN. |
|
|
|
|
|
![Dopamine on GridWorld](images/dopamine_gridworld_plot.png) |
|
|
|
|