Transformers documentation

PE Audio Video (Perception Encoder Audio-Video)

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.3.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.

PE Audio Video (Perception Encoder Audio-Video)

Overview

TODO

Usage

Basic usage

TODO

PeAudioVideoProcessor

class transformers.PeAudioVideoProcessor

< source >

( *args **kwargs )

call

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None text: str | list[str] | list[list[str]] | None = None videos: typing.Union[list['PIL.Image.Image'], numpy.ndarray, ForwardRef('torch.Tensor'), list[numpy.ndarray], list['torch.Tensor'], list[list['PIL.Image.Image']], list[list[numpy.ndarray]], list[list['torch.Tensor']], transformers.video_utils.URL, list[transformers.video_utils.URL], list[list[transformers.video_utils.URL]], transformers.video_utils.Path, list[transformers.video_utils.Path], list[list[transformers.video_utils.Path]], NoneType] = None audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor'], NoneType] = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.ProcessingKwargs] ) → BatchFeature

Parameters

images (PIL.Image.Image, np.ndarray, torch.Tensor, list[PIL.Image.Image], list[np.ndarray], list[torch.Tensor]) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported.
text (TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
videos (np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]) — The video or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
audio (np.ndarray, torch.Tensor, list[np.ndarray], list[torch.Tensor]) — The audio or batch of audio to be prepared. Each audio can be a NumPy array or PyTorch tensor.
return_tensors (str or TensorType, optional) — If set, will return tensors of a particular framework. Acceptable values are:
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return NumPy np.ndarray objects.

Returns

BatchFeature

A BatchFeature object with processed inputs in a dict format.

Main method to prepare for model inputs. This method forwards the each modality argument to its own processor along with kwargs. Please refer to the docstring of the each processor attributes for more information.

PeAudioVideoConfig

class transformers.PeAudioVideoConfig

< source >

( text_config = None audio_video_config = None **kwargs )

Parameters

text_config (“) — The config object or dictionary of the text backbone.
audio_video_config (dict or PreTrainedConfig, optional) — Configuration for the audio-video encoder component.
```python —

from transformers import PeAudioVideoModel, PeAudioVideoConfig

This is the configuration class to store the configuration of a PeAudioVideoModel. It is used to instantiate a Pe Audio Video model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/pe-av-large

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

PeAudioVideoEncoderConfig

class transformers.PeAudioVideoEncoderConfig

< source >

( audio_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None video_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None hidden_size: int | None = 1792 intermediate_size: int | None = 4800 num_hidden_layers: int | None = 6 num_attention_heads: int | None = 14 num_key_value_heads: int | None = None head_dim: int | None = 128 hidden_act: str | None = 'silu' max_position_embeddings: int | None = 10000 initializer_range: float | None = 0.02 rms_norm_eps: float | None = 1e-05 rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict | None = {'rope_theta': 20000} attention_bias: bool | None = False attention_dropout: float | None = 0.0 **kwargs )

Parameters

audio_config (Union[dict, ~configuration_utils.PreTrainedConfig], optional) — The config object or dictionary of the audio backbone.
video_config (Union[PreTrainedConfig, dict], optional) — Configuration for the video encoder. If a dictionary is provided, it is used to instantiate PeVideoEncoderConfig.
```python —

from transformers import PeAudioVideoEncoder, PeAudioVideoEncoderConfig

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

PeAudioVideoModel

class transformers.PeAudioVideoModel

< source >

( config: PeAudioVideoConfig )

forward

< source >

PeAudioVideoEncoder

class transformers.PeAudioVideoEncoder

< source >

( config: PeAudioVideoEncoderConfig )

Parameters

config (PeAudioVideoEncoderConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The PeAudioVideo Encoder model.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: torch.Tensor | None = None pixel_values_videos: torch.Tensor | None = None padding_mask: torch.Tensor | None = None padding_mask_videos: torch.Tensor | None = None **kwargs )

Update on GitHub

←PaliGemma Perceiver→

Transformers

PE Audio Video (Perception Encoder Audio-Video)

Overview

Usage

Basic usage

PeAudioVideoProcessor

class transformers.PeAudioVideoProcessor

__call__

PeAudioVideoConfig

class transformers.PeAudioVideoConfig

PeAudioVideoEncoderConfig

class transformers.PeAudioVideoEncoderConfig

PeAudioVideoModel

class transformers.PeAudioVideoModel

forward

PeAudioVideoEncoder

class transformers.PeAudioVideoEncoder

forward

call