ppo-Pyramids-Training / docs /Python-Optimizer-Documentation.md
AnnaMats's picture
Second Push
05c9ac2
# Table of Contents
* [mlagents.trainers.optimizer.torch\_optimizer](#mlagents.trainers.optimizer.torch_optimizer)
* [TorchOptimizer](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer)
* [create\_reward\_signals](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals)
* [get\_trajectory\_value\_estimates](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates)
* [mlagents.trainers.optimizer.optimizer](#mlagents.trainers.optimizer.optimizer)
* [Optimizer](#mlagents.trainers.optimizer.optimizer.Optimizer)
* [update](#mlagents.trainers.optimizer.optimizer.Optimizer.update)
<a name="mlagents.trainers.optimizer.torch_optimizer"></a>
# mlagents.trainers.optimizer.torch\_optimizer
<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer"></a>
## TorchOptimizer Objects
```python
class TorchOptimizer(Optimizer)
```
<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals"></a>
#### create\_reward\_signals
```python
| create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None
```
Create reward signals
**Arguments**:
- `reward_signal_configs`: Reward signal config.
<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates"></a>
#### get\_trajectory\_value\_estimates
```python
| get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]]
```
Get value estimates and memories for a trajectory, in batch form.
**Arguments**:
- `batch`: An AgentBuffer that consists of a trajectory.
- `next_obs`: the next observation (after the trajectory). Used for boostrapping
if this is not a termiinal trajectory.
- `done`: Set true if this is a terminal trajectory.
- `agent_id`: Agent ID of the agent that this trajectory belongs to.
**Returns**:
A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)],
the final value estimate as a Dict of [name, float], and optionally (if using memories)
an AgentBufferField of initial critic memories to be used during update.
<a name="mlagents.trainers.optimizer.optimizer"></a>
# mlagents.trainers.optimizer.optimizer
<a name="mlagents.trainers.optimizer.optimizer.Optimizer"></a>
## Optimizer Objects
```python
class Optimizer(abc.ABC)
```
Creates loss functions and auxillary networks (e.g. Q or Value) needed for training.
Provides methods to update the Policy.
<a name="mlagents.trainers.optimizer.optimizer.Optimizer.update"></a>
#### update
```python
| @abc.abstractmethod
| update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float]
```
Update the Policy based on the batch that was passed in.
**Arguments**:
- `batch`: AgentBuffer that contains the minibatch of data used for this update.
- `num_sequences`: Number of recurrent sequences found in the minibatch.
**Returns**:
A Dict containing statistics (name, value) from the update (e.g. loss)