zjowowen's picture
init space
079c32c
|
raw
history blame
5.58 kB
# gym-hybrid
Repository containing a collection of environment for reinforcement learning task possessing discrete-continuous hybrid action space.
## "Sliding-v0" and "Moving-v0"
<img align="right" width="300" src="moving_v0.gif">
"Moving-v0" and "Sliding-v0" are sandbox environments for parameterized action-space algorithms. The goal of the agent is to stop inside a target area.
The field is a square with a side length of 2. The target area is a circle with radius 0.1. There is three discrete actions: turn, accelerate, and break. In addition to the action, there is 2 possible complementary parameters: acceleration and rotation.
The episode terminates if one of the three condition is filled:
* the agent stop inside the target area,
* the agent leaves the field,
* the step count is higher than the limit (set by default at 200).
The moving environment doesn't take into account the conservation of inertia, while the sliding environment does. `Sliding-v0` is therefore more realistic than `Moving-v0`.
All the parameters, actions, states and rewards are the same between the two environments. Only the underlying physics changes.
### State
The [state](https://github.com/thomashirtz/gym-hybrid/blob/fee4bf5de2dc1dd0d2a5431498124b2c071a2344/gym_hybrid/environments.py#L126) is constituted of a list of 10 elements. The environment related values are: the current step divided by the maximum step, and the position of the target (x and y). The player related values are the position (x and y), the speed, the direction (cosine and sine), the distance related to the target, and an indicator that becomes 1 if the player is inside the target zone.
```python
state = [
agent.x,
agent.y,
agent.speed,
np.cos(agent.theta),
np.sin(agent.theta),
target.x,
target.y,
distance,
0 if distance > target_radius else 1,
current_step / max_step
]
```
### Reward
The [reward](https://github.com/thomashirtz/gym-hybrid/blob/fee4bf5de2dc1dd0d2a5431498124b2c071a2344/gym_hybrid/environments.py#L141) is the distance of the agent from the target of the last step minus the current distance. There is a penalty (set by default at a low value) to incentivize the learning algorithm to score as quickly as possible. A bonus reward of one is added if the player achieve to stop inside the target area. A malus of one is applied if the step count exceed the limit or if the player leaves the field.
### Actions
**The action ids are:**
1. Accelerate
2. Turn
3. Break
**The parameters are:**
1. Acceleration value
2. Rotation value
**There is two distinct way to format an action:**
Action with all the parameters (convenient if the model output all the parameters):
```python
action = (action_id, [acceleration_value, rotation_value])
```
Example of a valid actions:
```python
action = (0, [0.1, 0.4])
action = (1, [0.0, 0.2])
action = (2, [0.1, 0.3])
```
Note: Only the parameter related to the action chosen will be used.
Action with only the parameter related to the action id (convenient for algorithms that output only the parameter
of the chosen action, since it doesn't require to pad the action):
```python
action = (0, [acceleration_value])
action = (1, [rotation_value])
action = (2, [])
```
Example of valid actions:
```python
action = (0, [0.1])
action = (1, [0.2])
action = (2, [])
```
### Basics
Make and initialize an environment:
```python
import gym
import gym_parametrized
sliding_env = gym.make('Sliding-v0')
sliding_env.reset()
moving_env = gym.make('Moving-v0')
moving_env.reset()
```
Get the action space and the observation space:
```python
ACTION_SPACE = env.action_space[0].n
PARAMETERS_SPACE = env.action_space[1].shape[0]
OBSERVATION_SPACE = env.observation_space.shape[0]
```
Run a random agent:
```python
done = False
while not done:
state, reward, done, info = env.step(env.action_space.sample())
print(f'State: {state} Reward: {reward} Done: {done}')
```
### Parameters
The parameter that can be modified during the initialization are:
* `seed` (default = None)
* `max_turn`, angle in radi that can be achieved in one step (default = np.pi/2)
* `max_acceleration`, acceleration that can be achieved in one step (if the input parameter is 1) (default = 0.5)
* `delta_t`, time step of one step (default = 0.005)
* `max_step`, limit of the number of step before the end of an environment (default = 200)
* `penalty`, value substracted to the reward each step to incentivise the agent to finish the environment quicker (default = 0.001)
Initialization with custom parameters:
```python
env = gym.make(
'Moving-v0',
seed=0,
max_turn=1,
max_acceleration=1.0,
delta_t=0.001,
max_step=500,
penalty=0.01
)
```
### Render & Recording
Two testing files are avalaible to show users how to render and record the environment:
* [Python file example for recording](tests/moving_record.py)
* [Python file example for rendering](tests/moving_render.py)
## Disclaimer
Even though the mechanics of the environment are done, maybe the hyperparameters will need some further adjustments.
## Reference
This environment is described in several papers such as:
[Parametrized Deep Q-Networks Learning, Xiong et al., 2018](https://arxiv.org/pdf/1810.06394.pdf)
[Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space, Fan et al., 2019](https://arxiv.org/pdf/1903.01344.pdf)
## Installation
Direct Installation from github using pip by running this command:
```shell
pip install git+https://github.com/thomashirtz/gym-hybrid#egg=gym-hybrid
```