File size: 3,107 Bytes
05c9ac2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Table of Contents

* [mlagents.trainers.optimizer.torch\_optimizer](#mlagents.trainers.optimizer.torch_optimizer)
  * [TorchOptimizer](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer)
    * [create\_reward\_signals](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals)
    * [get\_trajectory\_value\_estimates](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates)
* [mlagents.trainers.optimizer.optimizer](#mlagents.trainers.optimizer.optimizer)
  * [Optimizer](#mlagents.trainers.optimizer.optimizer.Optimizer)
    * [update](#mlagents.trainers.optimizer.optimizer.Optimizer.update)

<a name="mlagents.trainers.optimizer.torch_optimizer"></a>
# mlagents.trainers.optimizer.torch\_optimizer

<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer"></a>
## TorchOptimizer Objects

```python
class TorchOptimizer(Optimizer)
```

<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals"></a>
#### create\_reward\_signals

```python
 | create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None
```

Create reward signals

**Arguments**:

- `reward_signal_configs`: Reward signal config.

<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates"></a>
#### get\_trajectory\_value\_estimates

```python
 | get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]]
```

Get value estimates and memories for a trajectory, in batch form.

**Arguments**:

- `batch`: An AgentBuffer that consists of a trajectory.
- `next_obs`: the next observation (after the trajectory). Used for boostrapping
    if this is not a termiinal trajectory.
- `done`: Set true if this is a terminal trajectory.
- `agent_id`: Agent ID of the agent that this trajectory belongs to.

**Returns**:

A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)],
    the final value estimate as a Dict of [name, float], and optionally (if using memories)
    an AgentBufferField of initial critic memories to be used during update.

<a name="mlagents.trainers.optimizer.optimizer"></a>
# mlagents.trainers.optimizer.optimizer

<a name="mlagents.trainers.optimizer.optimizer.Optimizer"></a>
## Optimizer Objects

```python
class Optimizer(abc.ABC)
```

Creates loss functions and auxillary networks (e.g. Q or Value) needed for training.
Provides methods to update the Policy.

<a name="mlagents.trainers.optimizer.optimizer.Optimizer.update"></a>
#### update

```python
 | @abc.abstractmethod
 | update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float]
```

Update the Policy based on the batch that was passed in.

**Arguments**:

- `batch`: AgentBuffer that contains the minibatch of data used for this update.
- `num_sequences`: Number of recurrent sequences found in the minibatch.

**Returns**:

A Dict containing statistics (name, value) from the update (e.g. loss)