ppo-Pyramids-Training / docs /Python-Optimizer-Documentation.md
AnnaMats's picture
Second Push
05c9ac2

Table of Contents

mlagents.trainers.optimizer.torch_optimizer

TorchOptimizer Objects

class TorchOptimizer(Optimizer)

create_reward_signals

 | create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None

Create reward signals

Arguments:

  • reward_signal_configs: Reward signal config.

get_trajectory_value_estimates

 | get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]]

Get value estimates and memories for a trajectory, in batch form.

Arguments:

  • batch: An AgentBuffer that consists of a trajectory.
  • next_obs: the next observation (after the trajectory). Used for boostrapping if this is not a termiinal trajectory.
  • done: Set true if this is a terminal trajectory.
  • agent_id: Agent ID of the agent that this trajectory belongs to.

Returns:

A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)], the final value estimate as a Dict of [name, float], and optionally (if using memories) an AgentBufferField of initial critic memories to be used during update.

mlagents.trainers.optimizer.optimizer

Optimizer Objects

class Optimizer(abc.ABC)

Creates loss functions and auxillary networks (e.g. Q or Value) needed for training. Provides methods to update the Policy.

update

 | @abc.abstractmethod
 | update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float]

Update the Policy based on the batch that was passed in.

Arguments:

  • batch: AgentBuffer that contains the minibatch of data used for this update.
  • num_sequences: Number of recurrent sequences found in the minibatch.

Returns:

A Dict containing statistics (name, value) from the update (e.g. loss)