File size: 7,147 Bytes
2fa5435
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
tags:
- pytorch
- safetensors
- pose-estimation
- head-pose
- landmark-to-pose
- distillation
- py-feat
library_name: py-feat
pipeline_tag: image-feature-extraction
license: mit
---

# Py-Feat Pose-MLP v2 β€” Landmark-to-6DoF Head Pose

A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace
layout produced by `mobilefacenet`, OpenFace, etc.) and emits 6DoF head
pose calibrated to img2pose's coordinate frame. Designed for `py-feat`
pipelines that use a face detector without a built-in pose head (e.g.
RetinaFace in `py-feat β‰₯ 0.7`).

## Model Description

`py-feat`'s v0.6 production pipeline used `img2pose` as its face detector,
which multi-tasks face localization with 6DoF head pose regression β€” so
pose came "for free" from the detector. In v0.7 the default face detector
became `RetinaFace` (much higher WIDERFACE Hard AP) which only detects
faces. To preserve the `Fex` schema (`pitch`, `roll`, `yaw`, `x`, `y`,
`z` columns), `py-feat` distills img2pose's pose regression into a small
MLP that operates entirely on already-computed landmarks.

The MLP is bbox-free: it normalizes incoming landmarks by their centroid
and inter-eye distance, so the same model works regardless of whether
the upstream detector produced loose (img2pose) or tight (RetinaFace)
face crops.

## Model Details

- **Model type**: Multi-layer perceptron (MLP)
- **Architecture**: `Linear(136β†’512) β†’ LayerNorm β†’ GELU β†’ Dropout(0.15)
  β†’ Linear(512β†’256) β†’ LayerNorm β†’ GELU β†’ Dropout β†’ Linear(256β†’128) β†’
  LayerNorm β†’ GELU β†’ Dropout β†’ Linear(128β†’6)`
- **Parameter count**: 236,934 (~0.9 MB safetensors)
- **Input**: 68 2D landmarks, normalized by landmark centroid and
  inter-eye distance (`feat.utils.face_pose_mlp.normalize_landmarks`).
- **Output**: 6 values β€” `[Pitch, Roll, Yaw, X, Y, Z]`. The MLP emits
  z-scored values; the loader de-normalizes using `mean`/`std` stored in
  the sidecar `pose_mlp_v2.json`. Angles are radians, calibrated to
  img2pose's coordinate frame.
- **Framework**: PyTorch (safetensors weight file, no pickle).
- **Inference cost**: ~10 Β΅s / face on CPU (batched), negligible vs.
  the upstream face/landmark stages.

## Training Details

- **Teacher**: `img2pose` (Albiero et al., 2021). The MLP is trained to
  match img2pose's regressed `[Pitch, Roll, Yaw, X, Y, Z]` outputs.
- **Training corpus**: CelebV-HQ β€” `n_clips = 35,445`,
  `n_train_frames = 2,783,134`, `n_val_frames = 154,619`. Frames with
  `FaceScore < 0.8` or `|pose| > 75Β°` are dropped (filters bad teacher
  signal on degenerate poses).
- **Loss**: MSE on z-scored 6D output.
- **Optimizer**: Adam, `lr=1e-3`, `batch_size=1024`.
- **Epochs**: 40 (best val loss at last epoch β€” see `pose_mlp_v2.json`
  for per-epoch history).
- **Hardware**: single GPU (training takes ~2 hr).
- **Seed**: 42.

### Held-out validation MAE on CelebV-HQ (clip-disjoint split)

| Axis | MAE (Β°) |
|---|---|
| Pitch | 2.66 |
| Roll | 2.34 |
| Yaw | 1.58 |

For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test
sets is ~4Β° average. The MLP cannot exceed its teacher; values here are
the gap between the MLP and the teacher's predictions, not against a
ground-truth motion-capture rig.

### v1 β†’ v2 changelog

| Aspect | v1 | v2 |
|---|---|---|
| Hidden | 256β†’128β†’64 | 512β†’256β†’128 |
| Activation | Linear β†’ ReLU β†’ Dropout | Linear β†’ LayerNorm β†’ GELU β†’ Dropout |
| Dropout | 0.10 | 0.15 |
| Training frames | 569,678 | 2,783,134 |
| Epochs | 30 | 40 |
| Best val loss | 0.0809 | 0.0777 |
| Roll MAE (Β°) | 2.530 | 2.335 |

## Intended Use

- **Primary**: Drop-in replacement for img2pose's pose head when using
  `py-feat` with a face detector that doesn't predict pose
  (`face_model='retinaface'` in `feat.Detector`, MediaPipe in
  `feat.MPDetector`).
- **Secondary**: Any pipeline that produces 68 dlib-style face landmarks
  and wants img2pose-compatible head pose without re-running img2pose.

### Out of scope

- Eye / gaze direction β€” use `L2CS-Net` for gaze.
- Mediapipe-478 landmarks β€” translate to 68 dlib landmarks first.
- Static head-pose inference from a single landmark (less than 68 pts).

## Usage

The MLP is loaded automatically by `feat.Detector` when
`face_model != 'img2pose'`. To call it directly:

```python
import torch
from feat.utils.face_pose_mlp import pose_from_landmarks_mlp

# 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet.
landmarks = torch.tensor([
    # ... [68, 2] ...
], dtype=torch.float32).unsqueeze(0)  # [1, 68, 2]

pose = pose_from_landmarks_mlp(landmarks)  # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z)
print(pose)
```

Weights resolve from (in order):
1. `FEAT_POSE_MLP_PATH` environment variable
2. `models/pose_mlp_v2.safetensors` in the repo
3. This HuggingFace repo (`py-feat/pose_mlp_v2`)

## Limitations

- The MLP cannot improve on img2pose's accuracy β€” it only matches it
  more efficiently with bbox-free input. Use img2pose directly if you
  need img2pose's exact behavior (a tiny ~1Β° distillation gap may remain).
- Trained on CelebV-HQ β€” performance on non-frontal, occluded, or
  heavily-rotated faces (>75Β°) is degraded by both the teacher and the
  data filter.
- Output coordinates are img2pose's frame, not a standard FACS / BIWI
  frame. Pose values are interpretable across the `py-feat` pipeline
  but may need recalibration to compare with other tools.

## Citation

If you use `py-feat` and this pose-MLP, please cite both `py-feat` and
img2pose:

```bibtex
@article{cheong2023pyfeat,
  title={Py-Feat: Python Facial Expression Analysis Toolbox},
  author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.},
  journal={Affective Science},
  volume={4},
  pages={781--796},
  year={2023}
}

@inproceedings{albiero2021img2pose,
  title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
  author={Albiero, VΓ­tor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={7617--7627},
  year={2021}
}

@inproceedings{zhu2022celebvhq,
  title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset},
  author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2022}
}
```

## License

MIT (this distillation). The teacher (`img2pose`) is BSD-3, and the
training corpus (CelebV-HQ) is released for non-commercial research
use β€” please honor each upstream license if you re-train or
re-distribute.

## Files

- `pose_mlp_v2.safetensors` β€” model weights (1 MB)
- `pose_mlp_v2.json` β€” architecture, output-normalization stats, training
  history, validation MAE per epoch
- `README.md` β€” this card

## Acknowledgments

Distilled from img2pose by VΓ­tor Albiero et al. (Meta AI / NVIDIA),
trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and
maintained by [Cosanlab](https://cosanlab.com) at Dartmouth.