File size: 7,147 Bytes

2fa5435

---
tags:
- pytorch
- safetensors
- pose-estimation
- head-pose
- landmark-to-pose
- distillation
- py-feat
library_name: py-feat
pipeline_tag: image-feature-extraction
license: mit
---

# Py-Feat Pose-MLP v2 — Landmark-to-6DoF Head Pose

A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace
layout produced by `mobilefacenet`, OpenFace, etc.) and emits 6DoF head
pose calibrated to img2pose's coordinate frame. Designed for `py-feat`
pipelines that use a face detector without a built-in pose head (e.g.
RetinaFace in `py-feat ≥ 0.7`).

## Model Description

`py-feat`'s v0.6 production pipeline used `img2pose` as its face detector,
which multi-tasks face localization with 6DoF head pose regression — so
pose came "for free" from the detector. In v0.7 the default face detector
became `RetinaFace` (much higher WIDERFACE Hard AP) which only detects
faces. To preserve the `Fex` schema (`pitch`, `roll`, `yaw`, `x`, `y`,
`z` columns), `py-feat` distills img2pose's pose regression into a small
MLP that operates entirely on already-computed landmarks.

The MLP is bbox-free: it normalizes incoming landmarks by their centroid
and inter-eye distance, so the same model works regardless of whether
the upstream detector produced loose (img2pose) or tight (RetinaFace)
face crops.

## Model Details

- **Model type**: Multi-layer perceptron (MLP)
- **Architecture**: `Linear(136→512) → LayerNorm → GELU → Dropout(0.15)
  → Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→128) →
  LayerNorm → GELU → Dropout → Linear(128→6)`
- **Parameter count**: 236,934 (~0.9 MB safetensors)
- **Input**: 68 2D landmarks, normalized by landmark centroid and
  inter-eye distance (`feat.utils.face_pose_mlp.normalize_landmarks`).
- **Output**: 6 values — `[Pitch, Roll, Yaw, X, Y, Z]`. The MLP emits
  z-scored values; the loader de-normalizes using `mean`/`std` stored in
  the sidecar `pose_mlp_v2.json`. Angles are radians, calibrated to
  img2pose's coordinate frame.
- **Framework**: PyTorch (safetensors weight file, no pickle).
- **Inference cost**: ~10 µs / face on CPU (batched), negligible vs.
  the upstream face/landmark stages.

## Training Details

- **Teacher**: `img2pose` (Albiero et al., 2021). The MLP is trained to
  match img2pose's regressed `[Pitch, Roll, Yaw, X, Y, Z]` outputs.
- **Training corpus**: CelebV-HQ — `n_clips = 35,445`,
  `n_train_frames = 2,783,134`, `n_val_frames = 154,619`. Frames with
  `FaceScore < 0.8` or `|pose| > 75°` are dropped (filters bad teacher
  signal on degenerate poses).
- **Loss**: MSE on z-scored 6D output.
- **Optimizer**: Adam, `lr=1e-3`, `batch_size=1024`.
- **Epochs**: 40 (best val loss at last epoch — see `pose_mlp_v2.json`
  for per-epoch history).
- **Hardware**: single GPU (training takes ~2 hr).
- **Seed**: 42.

### Held-out validation MAE on CelebV-HQ (clip-disjoint split)

| Axis | MAE (°) |
|---|---|
| Pitch | 2.66 |
| Roll | 2.34 |
| Yaw | 1.58 |

For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test
sets is ~4° average. The MLP cannot exceed its teacher; values here are
the gap between the MLP and the teacher's predictions, not against a
ground-truth motion-capture rig.

### v1 → v2 changelog

| Aspect | v1 | v2 |
|---|---|---|
| Hidden | 256→128→64 | 512→256→128 |
| Activation | Linear → ReLU → Dropout | Linear → LayerNorm → GELU → Dropout |
| Dropout | 0.10 | 0.15 |
| Training frames | 569,678 | 2,783,134 |
| Epochs | 30 | 40 |
| Best val loss | 0.0809 | 0.0777 |
| Roll MAE (°) | 2.530 | 2.335 |

## Intended Use

- **Primary**: Drop-in replacement for img2pose's pose head when using
  `py-feat` with a face detector that doesn't predict pose
  (`face_model='retinaface'` in `feat.Detector`, MediaPipe in
  `feat.MPDetector`).
- **Secondary**: Any pipeline that produces 68 dlib-style face landmarks
  and wants img2pose-compatible head pose without re-running img2pose.

### Out of scope

- Eye / gaze direction — use `L2CS-Net` for gaze.
- Mediapipe-478 landmarks — translate to 68 dlib landmarks first.
- Static head-pose inference from a single landmark (less than 68 pts).

## Usage

The MLP is loaded automatically by `feat.Detector` when
`face_model != 'img2pose'`. To call it directly:

```python
import torch
from feat.utils.face_pose_mlp import pose_from_landmarks_mlp

# 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet.
landmarks = torch.tensor([
    # ... [68, 2] ...
], dtype=torch.float32).unsqueeze(0)  # [1, 68, 2]

pose = pose_from_landmarks_mlp(landmarks)  # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z)
print(pose)
```

Weights resolve from (in order):
1. `FEAT_POSE_MLP_PATH` environment variable
2. `models/pose_mlp_v2.safetensors` in the repo
3. This HuggingFace repo (`py-feat/pose_mlp_v2`)

## Limitations

- The MLP cannot improve on img2pose's accuracy — it only matches it
  more efficiently with bbox-free input. Use img2pose directly if you
  need img2pose's exact behavior (a tiny ~1° distillation gap may remain).
- Trained on CelebV-HQ — performance on non-frontal, occluded, or
  heavily-rotated faces (>75°) is degraded by both the teacher and the
  data filter.
- Output coordinates are img2pose's frame, not a standard FACS / BIWI
  frame. Pose values are interpretable across the `py-feat` pipeline
  but may need recalibration to compare with other tools.

## Citation

If you use `py-feat` and this pose-MLP, please cite both `py-feat` and
img2pose:

```bibtex
@article{cheong2023pyfeat,
  title={Py-Feat: Python Facial Expression Analysis Toolbox},
  author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.},
  journal={Affective Science},
  volume={4},
  pages={781--796},
  year={2023}
}

@inproceedings{albiero2021img2pose,
  title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
  author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={7617--7627},
  year={2021}
}

@inproceedings{zhu2022celebvhq,
  title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset},
  author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2022}
}
```

## License

MIT (this distillation). The teacher (`img2pose`) is BSD-3, and the
training corpus (CelebV-HQ) is released for non-commercial research
use — please honor each upstream license if you re-train or
re-distribute.

## Files

- `pose_mlp_v2.safetensors` — model weights (1 MB)
- `pose_mlp_v2.json` — architecture, output-normalization stats, training
  history, validation MAE per epoch
- `README.md` — this card

## Acknowledgments

Distilled from img2pose by Vítor Albiero et al. (Meta AI / NVIDIA),
trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and
maintained by [Cosanlab](https://cosanlab.com) at Dartmouth.