ppo-Pyramids-Training / docs /Python-Gym-API.md

Second Push

05c9ac2 over 1 year ago

13.4 kB

	# Unity ML-Agents Gym Wrapper

	A common way in which machine learning researchers interact with simulation
	environments is via a wrapper provided by OpenAI called `gym`. For more
	information on the gym interface, see [here](https://github.com/openai/gym).

	We provide a gym wrapper and instructions for using it with existing machine
	learning algorithms which utilize gym. Our wrapper provides interfaces on top of
	our `UnityEnvironment` class, which is the default way of interfacing with a
	Unity environment via Python.

	## Installation

	The gym wrapper is part of the `mlgents_envs` package. Please refer to the
	[mlagents_envs installation instructions](ML-Agents-Envs-README.md).


	## Using the Gym Wrapper

	The gym interface is available from `gym_unity.envs`. To launch an environment
	from the root of the project repository use:

	```python
	from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper

	env = UnityToGymWrapper(unity_env, uint8_visual, flatten_branched, allow_multiple_obs)
	```

	- `unity_env` refers to the Unity environment to be wrapped.

	- `uint8_visual` refers to whether to output visual observations as `uint8`
	values (0-255). Many common Gym environments (e.g. Atari) do this. By default
	they will be floats (0.0-1.0). Defaults to `False`.

	- `flatten_branched` will flatten a branched discrete action space into a Gym
	Discrete. Otherwise, it will be converted into a MultiDiscrete. Defaults to
	`False`.

	- `allow_multiple_obs` will return a list of observations. The first elements
	contain the visual observations and the last element contains the array of
	vector observations. If False the environment returns a single array (containing
	a single visual observations, if present, otherwise the vector observation).
	Defaults to `False`.

	- `action_space_seed` is the optional seed for action sampling. If non-None, will
	be used to set the random seed on created gym.Space instances.

	The returned environment `env` will function as a gym.

	## Limitations

	- It is only possible to use an environment with a single Agent.
	- By default, the first visual observation is provided as the `observation`, if
	present. Otherwise, vector observations are provided. You can receive all
	visual and vector observations by using the `allow_multiple_obs=True` option in
	the gym parameters. If set to `True`, you will receive a list of `observation`
	instead of only one.
	- The `TerminalSteps` or `DecisionSteps` output from the environment can still
	be accessed from the `info` provided by `env.step(action)`.
	- Stacked vector observations are not supported.
	- Environment registration for use with `gym.make()` is currently not supported.
	- Calling env.render() will not render a new frame of the environment. It will
	return the latest visual observation if using visual observations.

	## Running OpenAI Baselines Algorithms

	OpenAI provides a set of open-source maintained and tested Reinforcement
	Learning algorithms called the [Baselines](https://github.com/openai/baselines).

	Using the provided Gym wrapper, it is possible to train ML-Agents environments
	using these algorithms. This requires the creation of custom training scripts to
	launch each algorithm. In most cases these scripts can be created by making
	slight modifications to the ones provided for Atari and Mujoco environments.

	These examples were tested with baselines version 0.1.6.

	### Example - DQN Baseline

	In order to train an agent to play the `GridWorld` environment using the
	Baselines DQN algorithm, you first need to install the baselines package using
	pip:

	```
	pip install git+git://github.com/openai/baselines
	```

	Next, create a file called `train_unity.py`. Then create an `/envs/` directory
	and build the environment to that directory. For more information on
	building Unity environments, see
	[here](../docs/Learning-Environment-Executable.md). Note that because of
	limitations of the DQN baseline, the environment must have a single visual
	observation, a single discrete action and a single Agent in the scene.
	Add the following code to the `train_unity.py` file:

	```python
	import gym

	from baselines import deepq
	from baselines import logger

	from mlagents_envs.environment import UnityEnvironment
	from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper


	def main():
	unity_env = UnityEnvironment( < path - to - environment >)
	env = UnityToGymWrapper(unity_env, uint8_visual=True)
	logger.configure('./logs') # Change to log in a different directory
	act = deepq.learn(
	env,
	"cnn", # For visual inputs
	lr=2.5e-4,
	total_timesteps=1000000,
	buffer_size=50000,
	exploration_fraction=0.05,
	exploration_final_eps=0.1,
	print_freq=20,
	train_freq=5,
	learning_starts=20000,
	target_network_update_freq=50,
	gamma=0.99,
	prioritized_replay=False,
	checkpoint_freq=1000,
	checkpoint_path='./logs', # Change to save model in a different directory
	dueling=True
	)
	print("Saving model to unity_model.pkl")
	act.save("unity_model.pkl")


	if __name__ == '__main__':
	main()
	```

	To start the training process, run the following from the directory containing
	`train_unity.py`:

	```sh
	python -m train_unity
	```

	### Other Algorithms

	Other algorithms in the Baselines repository can be run using scripts similar to
	the examples from the baselines package. In most cases, the primary changes
	needed to use a Unity environment are to import `UnityToGymWrapper`, and to
	replace the environment creation code, typically `gym.make()`, with a call to
	`UnityToGymWrapper(unity_environment)` passing the environment as input.

	A typical rule of thumb is that for vision-based environments, modification
	should be done to Atari training scripts, and for vector observation
	environments, modification should be done to Mujoco scripts.

	Some algorithms will make use of `make_env()` or `make_mujoco_env()` functions.
	You can define a similar function for Unity environments. An example of such a
	method using the PPO2 baseline:

	```python
	from mlagents_envs.environment import UnityEnvironment
	from mlagents_envs.envs import UnityToGymWrapper
	from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
	from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
	from baselines.bench import Monitor
	from baselines import logger
	import baselines.ppo2.ppo2 as ppo2

	import os

	try:
	from mpi4py import MPI
	except ImportError:
	MPI = None


	def make_unity_env(env_directory, num_env, visual, start_index=0):
	"""
	Create a wrapped, monitored Unity environment.
	"""

	def make_env(rank, use_visual=True): # pylint: disable=C0111
	def _thunk():
	unity_env = UnityEnvironment(env_directory, base_port=5000 + rank)
	env = UnityToGymWrapper(unity_env, uint8_visual=True)
	env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
	return env

	return _thunk

	if visual:
	return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
	else:
	rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
	return DummyVecEnv([make_env(rank, use_visual=False)])


	def main():
	env = make_unity_env( < path - to - environment >, 4, True)
	ppo2.learn(
	network="mlp",
	env=env,
	total_timesteps=100000,
	lr=1e-3,
	)


	if __name__ == '__main__':
	main()
	```

	## Run Google Dopamine Algorithms

	Google provides a framework [Dopamine](https://github.com/google/dopamine), and
	implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of
	Rainbow. Using the Gym wrapper, we can run Unity environments using Dopamine.

	First, after installing the Gym wrapper, clone the Dopamine repository.

	```
	git clone https://github.com/google/dopamine
	```

	Then, follow the appropriate install instructions as specified on
	[Dopamine's homepage](https://github.com/google/dopamine). Note that the
	Dopamine guide specifies using a virtualenv. If you choose to do so, make sure
	your unity_env package is also installed within the same virtualenv as Dopamine.

	### Adapting Dopamine's Scripts

	First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire
	`atari` folder, and name it something else (e.g. `unity`). If you choose the
	copy approach, be sure to change the package names in the import statements in
	`train.py` to your new directory.

	Within `run_experiment.py`, we will need to make changes to which environment is
	instantiated, just as in the Baselines example. At the top of the file, insert

	```python
	from mlagents_envs.environment import UnityEnvironment
	from mlagents_envs.envs import UnityToGymWrapper
	```

	to import the Gym Wrapper. Navigate to the `create_atari_environment` method in
	the same file, and switch to instantiating a Unity environment by replacing the
	method with the following code.

	```python
	game_version = 'v0' if sticky_actions else 'v4'
	full_game_name = '{}NoFrameskip-{}'.format(game_name, game_version)
	unity_env = UnityEnvironment(<path-to-environment>)
	env = UnityToGymWrapper(unity_env, uint8_visual=True)
	return env
	```

	`<path-to-environment>` is the path to your built Unity executable. For more
	information on building Unity environments, see
	[here](../docs/Learning-Environment-Executable.md), and note the Limitations
	section below.

	Note that we are not using the preprocessor from Dopamine, as it uses many
	Atari-specific calls. Furthermore, frame-skipping can be done from within Unity,
	rather than on the Python side.

	### Limitations

	Since Dopamine is designed around variants of DQN, it is only compatible with
	discrete action spaces, and specifically the Discrete Gym space. For
	environments that use branched discrete action spaces, you can enable the
	`flatten_branched` parameter in `UnityToGymWrapper`, which treats each
	combination of branched actions as separate actions.

	Furthermore, when building your environments, ensure that your Agent is using
	visual observations with greyscale enabled, and that the dimensions of the
	visual observations is 84 by 84 (matches the parameter found in `dqn_agent.py`
	and `rainbow_agent.py`). Dopamine's agents currently do not automatically adapt
	to the observation dimensions or number of channels.

	### Hyperparameters

	The hyperparameters provided by Dopamine are tailored to the Atari games, and
	you will likely need to adjust them for ML-Agents environments. Here is a sample
	`dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with
	a simple GridWorld.

	```python
	import dopamine.agents.rainbow.rainbow_agent
	import dopamine.unity.run_experiment
	import dopamine.replay_memory.prioritized_replay_buffer
	import gin.tf.external_configurables

	RainbowAgent.num_atoms = 51
	RainbowAgent.stack_size = 1
	RainbowAgent.vmax = 10.
	RainbowAgent.gamma = 0.99
	RainbowAgent.update_horizon = 3
	RainbowAgent.min_replay_history = 20000 # agent steps
	RainbowAgent.update_period = 5
	RainbowAgent.target_update_period = 50 # agent steps
	RainbowAgent.epsilon_train = 0.1
	RainbowAgent.epsilon_eval = 0.01
	RainbowAgent.epsilon_decay_period = 50000 # agent steps
	RainbowAgent.replay_scheme = 'prioritized'
	RainbowAgent.tf_device = '/cpu:0' # use '/cpu:*' for non-GPU version
	RainbowAgent.optimizer = @tf.train.AdamOptimizer()

	tf.train.AdamOptimizer.learning_rate = 0.00025
	tf.train.AdamOptimizer.epsilon = 0.0003125

	Runner.game_name = "Unity" # any name can be used here
	Runner.sticky_actions = False
	Runner.num_iterations = 200
	Runner.training_steps = 10000 # agent steps
	Runner.evaluation_steps = 500 # agent steps
	Runner.max_steps_per_episode = 27000 # agent steps

	WrappedPrioritizedReplayBuffer.replay_capacity = 1000000
	WrappedPrioritizedReplayBuffer.batch_size = 32
	```

	This example assumed you copied `atari` to a separate folder named `unity`.
	Replace `unity` in `import dopamine.unity.run_experiment` with the folder you
	copied your `run_experiment.py` and `trainer.py` files to. If you directly
	modified the existing files, then use `atari` here.

	### Starting a Run

	You can now run Dopamine as you would normally:

	```
	python -um dopamine.unity.train \
	--agent_name=rainbow \
	--base_dir=/tmp/dopamine \
	--gin_files='dopamine/agents/rainbow/configs/rainbow.gin'
	```

	Again, we assume that you've copied `atari` into a separate folder. Remember to
	replace `unity` with the directory you copied your files into. If you edited the
	Atari files directly, this should be `atari`.

	### Example: GridWorld

	As a baseline, here are rewards over time for the three algorithms provided with
	Dopamine as run on the GridWorld example environment. All Dopamine (DQN,
	Rainbow, C51) runs were done with the same epsilon, epsilon decay, replay
	history, training steps, and buffer settings as specified above. Note that the
	first 20000 steps are used to pre-fill the training buffer, and no learning
	happens.

	We provide results from our PPO implementation and the DQN from Baselines as
	reference. Note that all runs used the same greyscale GridWorld as Dopamine. For
	PPO, `num_layers` was set to 2, and all other hyperparameters are the default
	for GridWorld in `config/ppo/GridWorld.yaml`. For Baselines DQN, the provided
	hyperparameters in the previous section are used. Note that Baselines implements
	certain features (e.g. dueling-Q) that are not enabled in Dopamine DQN.


	![Dopamine on GridWorld](images/dopamine_gridworld_plot.png)