|
# Training on Amazon Web Service |
|
|
|
:warning: **Note:** We no longer use this guide ourselves and so it may not work |
|
correctly. We've decided to keep it up just in case it is helpful to you. |
|
|
|
This page contains instructions for setting up an EC2 instance on Amazon Web |
|
Service for training ML-Agents environments. |
|
|
|
## Pre-configured AMI |
|
|
|
We've prepared a pre-configured AMI for you with the ID: `ami-016ff5559334f8619` |
|
in the `us-east-1` region. It was created as a modification of |
|
[Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C). |
|
The AMI has been tested with p2.xlarge instance. Furthermore, if you want to |
|
train without headless mode, you need to enable X Server. |
|
|
|
After launching your EC2 instance using the ami and ssh into it, run the |
|
following commands to enable it: |
|
|
|
```sh |
|
# Start the X Server, press Enter to come to the command line |
|
$ sudo /usr/bin/X :0 & |
|
|
|
# Check if Xorg process is running |
|
# You will have a list of processes running on the GPU, Xorg should be in the |
|
# list, as shown below |
|
$ nvidia-smi |
|
|
|
# Thu Jun 14 20:27:26 2018 |
|
# +-----------------------------------------------------------------------------+ |
|
# | NVIDIA-SMI 390.67 Driver Version: 390.67 | |
|
# |-------------------------------+----------------------+----------------------+ |
|
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | |
|
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |
|
# |===============================+======================+======================| |
|
# | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 | |
|
# | N/A 35C P8 31W / 149W | 9MiB / 11441MiB | 0% Default | |
|
# +-------------------------------+----------------------+----------------------+ |
|
# |
|
# +-----------------------------------------------------------------------------+ |
|
# | Processes: GPU Memory | |
|
# | GPU PID Type Process name Usage | |
|
# |=============================================================================| |
|
# | 0 2331 G /usr/lib/xorg/Xorg 8MiB | |
|
# +-----------------------------------------------------------------------------+ |
|
|
|
# Make the ubuntu use X Server for display |
|
$ export DISPLAY=:0 |
|
``` |
|
|
|
## Configuring your own instance |
|
|
|
You could also choose to configure your own instance. To begin with, you will |
|
need an EC2 instance which contains the latest Nvidia drivers, CUDA9, and cuDNN. |
|
In this tutorial we used the |
|
[Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C) |
|
listed under AWS Marketplace with a p2.xlarge instance. |
|
|
|
### Installing the ML-Agents Toolkit on the instance |
|
|
|
After launching your EC2 instance using the ami and ssh into it: |
|
|
|
1. Activate the python3 environment |
|
|
|
```sh |
|
source activate python3 |
|
``` |
|
|
|
2. Clone the ML-Agents repo and install the required Python packages |
|
|
|
```sh |
|
git clone --branch release_20 https://github.com/Unity-Technologies/ml-agents.git |
|
cd ml-agents/ml-agents/ |
|
pip3 install -e . |
|
``` |
|
|
|
### Setting up X Server (optional) |
|
|
|
X Server setup is only necessary if you want to do training that requires visual |
|
observation input. _Instructions here are adapted from this |
|
[Medium post](https://medium.com/towards-data-science/how-to-run-unity-on-amazon-cloud-or-without-monitor-3c10ce022639) |
|
on running general Unity applications in the cloud._ |
|
|
|
Current limitations of the Unity Engine require that a screen be available to |
|
render to when using visual observations. In order to make this possible when |
|
training on a remote server, a virtual screen is required. We can do this by |
|
installing Xorg and creating a virtual screen. Once installed and created, we |
|
can display the Unity environment in the virtual environment, and train as we |
|
would on a local machine. Ensure that `headless` mode is disabled when building |
|
linux executables which use visual observations. |
|
|
|
#### Install and setup Xorg: |
|
|
|
```sh |
|
# Install Xorg |
|
$ sudo apt-get update |
|
$ sudo apt-get install -y xserver-xorg mesa-utils |
|
$ sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024 |
|
|
|
# Get the BusID information |
|
$ nvidia-xconfig --query-gpu-info |
|
|
|
# Add the BusID information to your /etc/X11/xorg.conf file |
|
$ sudo sed -i 's/ BoardName "Tesla K80"/ BoardName "Tesla K80"\n BusID "0:30:0"/g' /etc/X11/xorg.conf |
|
|
|
# Remove the Section "Files" from the /etc/X11/xorg.conf file |
|
# And remove two lines that contain Section "Files" and EndSection |
|
$ sudo vim /etc/X11/xorg.conf |
|
``` |
|
|
|
#### Update and setup Nvidia driver: |
|
|
|
```sh |
|
# Download and install the latest Nvidia driver for ubuntu |
|
# Please refer to http://download.nvidia.com/XFree86/Linux-#x86_64/latest.txt |
|
$ wget http://download.nvidia.com/XFree86/Linux-x86_64/390.87/NVIDIA-Linux-x86_64-390.87.run |
|
$ sudo /bin/bash ./NVIDIA-Linux-x86_64-390.87.run --accept-license --no-questions --ui=none |
|
|
|
# Disable Nouveau as it will clash with the Nvidia driver |
|
$ sudo echo 'blacklist nouveau' | sudo tee -a /etc/modprobe.d/blacklist.conf |
|
$ sudo echo 'options nouveau modeset=0' | sudo tee -a /etc/modprobe.d/blacklist.conf |
|
$ sudo echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf |
|
$ sudo update-initramfs -u |
|
``` |
|
|
|
#### Restart the EC2 instance: |
|
|
|
```sh |
|
sudo reboot now |
|
``` |
|
|
|
#### Make sure there are no Xorg processes running: |
|
|
|
```sh |
|
# Kill any possible running Xorg processes |
|
# Note that you might have to run this command multiple times depending on |
|
# how Xorg is configured. |
|
$ sudo killall Xorg |
|
|
|
# Check if there is any Xorg process left |
|
# You will have a list of processes running on the GPU, Xorg should not be in |
|
# the list, as shown below. |
|
$ nvidia-smi |
|
|
|
# Thu Jun 14 20:21:11 2018 |
|
# +-----------------------------------------------------------------------------+ |
|
# | NVIDIA-SMI 390.67 Driver Version: 390.67 | |
|
# |-------------------------------+----------------------+----------------------+ |
|
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | |
|
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |
|
# |===============================+======================+======================| |
|
# | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 | |
|
# | N/A 37C P8 31W / 149W | 0MiB / 11441MiB | 0% Default | |
|
# +-------------------------------+----------------------+----------------------+ |
|
# |
|
# +-----------------------------------------------------------------------------+ |
|
# | Processes: GPU Memory | |
|
# | GPU PID Type Process name Usage | |
|
# |=============================================================================| |
|
# | No running processes found | |
|
# +-----------------------------------------------------------------------------+ |
|
|
|
``` |
|
|
|
#### Start X Server and make the ubuntu use X Server for display: |
|
|
|
```console |
|
# Start the X Server, press Enter to come back to the command line |
|
$ sudo /usr/bin/X :0 & |
|
|
|
# Check if Xorg process is running |
|
# You will have a list of processes running on the GPU, Xorg should be in the list. |
|
$ nvidia-smi |
|
|
|
# Make the ubuntu use X Server for display |
|
$ export DISPLAY=:0 |
|
``` |
|
|
|
#### Ensure the Xorg is correctly configured: |
|
|
|
```sh |
|
# For more information on glxgears, see ftp://www.x.org/pub/X11R6.8.1/doc/glxgears.1.html. |
|
$ glxgears |
|
# If Xorg is configured correctly, you should see the following message |
|
|
|
# Running synchronized to the vertical refresh. The framerate should be |
|
# approximately the same as the monitor refresh rate. |
|
# 137296 frames in 5.0 seconds = 27459.053 FPS |
|
# 141674 frames in 5.0 seconds = 28334.779 FPS |
|
# 141490 frames in 5.0 seconds = 28297.875 FPS |
|
|
|
``` |
|
|
|
## Training on EC2 instance |
|
|
|
1. In the Unity Editor, load a project containing an ML-Agents environment (you |
|
can use one of the example environments if you have not created your own). |
|
2. Open the Build Settings window (menu: File > Build Settings). |
|
3. Select Linux as the Target Platform, and x86_64 as the target architecture |
|
(the default x86 currently does not work). |
|
4. Check Headless Mode if you have not setup the X Server. (If you do not use |
|
Headless Mode, you have to setup the X Server to enable training.) |
|
5. Click Build to build the Unity environment executable. |
|
6. Upload the executable to your EC2 instance within `ml-agents` folder. |
|
7. Change the permissions of the executable. |
|
|
|
```sh |
|
chmod +x <your_env>.x86_64 |
|
``` |
|
|
|
8. (Without Headless Mode) Start X Server and use it for display: |
|
|
|
```sh |
|
# Start the X Server, press Enter to come back to the command line |
|
$ sudo /usr/bin/X :0 & |
|
|
|
# Check if Xorg process is running |
|
# You will have a list of processes running on the GPU, Xorg should be in the list. |
|
$ nvidia-smi |
|
|
|
# Make the ubuntu use X Server for display |
|
$ export DISPLAY=:0 |
|
``` |
|
|
|
9. Test the instance setup from Python using: |
|
|
|
```python |
|
from mlagents_envs.environment import UnityEnvironment |
|
|
|
env = UnityEnvironment(<your_env>) |
|
``` |
|
|
|
Where `<your_env>` corresponds to the path to your environment executable. |
|
|
|
You should receive a message confirming that the environment was loaded |
|
successfully. |
|
|
|
10. Train your models |
|
|
|
```console |
|
mlagents-learn <trainer-config-file> --env=<your_env> --train |
|
``` |
|
|
|
## FAQ |
|
|
|
### The <Executable_Name>\_Data folder hasn't been copied cover |
|
|
|
If you've built your Linux executable, but forget to copy over the corresponding |
|
<Executable_Name>\_Data folder, you will see error message like the following: |
|
|
|
```sh |
|
Set current directory to /home/ubuntu/ml-agents/ml-agents |
|
Found path: /home/ubuntu/ml-agents/ml-agents/3dball_linux.x86_64 |
|
no boot config - using default values |
|
|
|
(Filename: Line: 403) |
|
|
|
There is no data folder |
|
``` |
|
|
|
### Unity Environment not responding |
|
|
|
If you didn't setup X Server or hasn't launched it properly, or your environment |
|
somehow crashes, or you haven't `chmod +x` your Unity Environment, all of these |
|
will cause connection between Unity and Python to fail. Then you will see |
|
something like this: |
|
|
|
```console |
|
Logging to /home/ubuntu/.config/unity3d/<Some_Path>/Player.log |
|
Traceback (most recent call last): |
|
File "<stdin>", line 1, in <module> |
|
File "/home/ubuntu/ml-agents/ml-agents/mlagents_envs/environment.py", line 63, in __init__ |
|
aca_params = self.send_academy_parameters(rl_init_parameters_in) |
|
File "/home/ubuntu/ml-agents/ml-agents/mlagents_envs/environment.py", line 489, in send_academy_parameters |
|
return self.communicator.initialize(inputs).rl_initialization_output |
|
File "/home/ubuntu/ml-agents/ml-agents/mlagents_envs/rpc_communicator.py", line 60, in initialize |
|
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that : |
|
The environment does not need user interaction to launch |
|
The environment and the Python interface have compatible versions. |
|
``` |
|
|
|
It would be also really helpful to check your |
|
/home/ubuntu/.config/unity3d/<Some_Path>/Player.log to see what happens with |
|
your Unity environment. |
|
|
|
### Could not launch X Server |
|
|
|
When you execute: |
|
|
|
```sh |
|
sudo /usr/bin/X :0 & |
|
``` |
|
|
|
You might see something like: |
|
|
|
```sh |
|
X.Org X Server 1.18.4 |
|
... |
|
(==) Log file: "/var/log/Xorg.0.log", Time: Thu Oct 11 21:10:38 2018 |
|
(==) Using config file: "/etc/X11/xorg.conf" |
|
(==) Using system config directory "/usr/share/X11/xorg.conf.d" |
|
(EE) |
|
Fatal server error: |
|
(EE) no screens found(EE) |
|
(EE) |
|
Please consult the X.Org Foundation support |
|
at http://wiki.x.org |
|
for help. |
|
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information. |
|
(EE) |
|
(EE) Server terminated with error (1). Closing log file. |
|
``` |
|
|
|
And when you execute: |
|
|
|
```sh |
|
nvidia-smi |
|
``` |
|
|
|
You might see something like: |
|
|
|
```sh |
|
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. |
|
``` |
|
|
|
This means the NVIDIA's driver needs to be updated. Refer to |
|
[this section](Training-on-Amazon-Web-Service.md#update-and-setup-nvidia-driver) |
|
for more information. |
|
|