--- tags: - pytorch - safetensors - pose-estimation - head-pose - landmark-to-pose - distillation - py-feat library_name: py-feat pipeline_tag: image-feature-extraction license: mit --- # Py-Feat Pose-MLP v2 — Landmark-to-6DoF Head Pose A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace layout produced by `mobilefacenet`, OpenFace, etc.) and emits 6DoF head pose calibrated to img2pose's coordinate frame. Designed for `py-feat` pipelines that use a face detector without a built-in pose head (e.g. RetinaFace in `py-feat ≥ 0.7`). ## Model Description `py-feat`'s v0.6 production pipeline used `img2pose` as its face detector, which multi-tasks face localization with 6DoF head pose regression — so pose came "for free" from the detector. In v0.7 the default face detector became `RetinaFace` (much higher WIDERFACE Hard AP) which only detects faces. To preserve the `Fex` schema (`pitch`, `roll`, `yaw`, `x`, `y`, `z` columns), `py-feat` distills img2pose's pose regression into a small MLP that operates entirely on already-computed landmarks. The MLP is bbox-free: it normalizes incoming landmarks by their centroid and inter-eye distance, so the same model works regardless of whether the upstream detector produced loose (img2pose) or tight (RetinaFace) face crops. ## Model Details - **Model type**: Multi-layer perceptron (MLP) - **Architecture**: `Linear(136→512) → LayerNorm → GELU → Dropout(0.15) → Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→128) → LayerNorm → GELU → Dropout → Linear(128→6)` - **Parameter count**: 236,934 (~0.9 MB safetensors) - **Input**: 68 2D landmarks, normalized by landmark centroid and inter-eye distance (`feat.utils.face_pose_mlp.normalize_landmarks`). - **Output**: 6 values — `[Pitch, Roll, Yaw, X, Y, Z]`. The MLP emits z-scored values; the loader de-normalizes using `mean`/`std` stored in the sidecar `pose_mlp_v2.json`. Angles are radians, calibrated to img2pose's coordinate frame. - **Framework**: PyTorch (safetensors weight file, no pickle). - **Inference cost**: ~10 µs / face on CPU (batched), negligible vs. the upstream face/landmark stages. ## Training Details - **Teacher**: `img2pose` (Albiero et al., 2021). The MLP is trained to match img2pose's regressed `[Pitch, Roll, Yaw, X, Y, Z]` outputs. - **Training corpus**: CelebV-HQ — `n_clips = 35,445`, `n_train_frames = 2,783,134`, `n_val_frames = 154,619`. Frames with `FaceScore < 0.8` or `|pose| > 75°` are dropped (filters bad teacher signal on degenerate poses). - **Loss**: MSE on z-scored 6D output. - **Optimizer**: Adam, `lr=1e-3`, `batch_size=1024`. - **Epochs**: 40 (best val loss at last epoch — see `pose_mlp_v2.json` for per-epoch history). - **Hardware**: single GPU (training takes ~2 hr). - **Seed**: 42. ### Held-out validation MAE on CelebV-HQ (clip-disjoint split) | Axis | MAE (°) | |---|---| | Pitch | 2.66 | | Roll | 2.34 | | Yaw | 1.58 | For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test sets is ~4° average. The MLP cannot exceed its teacher; values here are the gap between the MLP and the teacher's predictions, not against a ground-truth motion-capture rig. ### v1 → v2 changelog | Aspect | v1 | v2 | |---|---|---| | Hidden | 256→128→64 | 512→256→128 | | Activation | Linear → ReLU → Dropout | Linear → LayerNorm → GELU → Dropout | | Dropout | 0.10 | 0.15 | | Training frames | 569,678 | 2,783,134 | | Epochs | 30 | 40 | | Best val loss | 0.0809 | 0.0777 | | Roll MAE (°) | 2.530 | 2.335 | ## Intended Use - **Primary**: Drop-in replacement for img2pose's pose head when using `py-feat` with a face detector that doesn't predict pose (`face_model='retinaface'` in `feat.Detector`, MediaPipe in `feat.MPDetector`). - **Secondary**: Any pipeline that produces 68 dlib-style face landmarks and wants img2pose-compatible head pose without re-running img2pose. ### Out of scope - Eye / gaze direction — use `L2CS-Net` for gaze. - Mediapipe-478 landmarks — translate to 68 dlib landmarks first. - Static head-pose inference from a single landmark (less than 68 pts). ## Usage The MLP is loaded automatically by `feat.Detector` when `face_model != 'img2pose'`. To call it directly: ```python import torch from feat.utils.face_pose_mlp import pose_from_landmarks_mlp # 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet. landmarks = torch.tensor([ # ... [68, 2] ... ], dtype=torch.float32).unsqueeze(0) # [1, 68, 2] pose = pose_from_landmarks_mlp(landmarks) # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z) print(pose) ``` Weights resolve from (in order): 1. `FEAT_POSE_MLP_PATH` environment variable 2. `models/pose_mlp_v2.safetensors` in the repo 3. This HuggingFace repo (`py-feat/pose_mlp_v2`) ## Limitations - The MLP cannot improve on img2pose's accuracy — it only matches it more efficiently with bbox-free input. Use img2pose directly if you need img2pose's exact behavior (a tiny ~1° distillation gap may remain). - Trained on CelebV-HQ — performance on non-frontal, occluded, or heavily-rotated faces (>75°) is degraded by both the teacher and the data filter. - Output coordinates are img2pose's frame, not a standard FACS / BIWI frame. Pose values are interpretable across the `py-feat` pipeline but may need recalibration to compare with other tools. ## Citation If you use `py-feat` and this pose-MLP, please cite both `py-feat` and img2pose: ```bibtex @article{cheong2023pyfeat, title={Py-Feat: Python Facial Expression Analysis Toolbox}, author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.}, journal={Affective Science}, volume={4}, pages={781--796}, year={2023} } @inproceedings{albiero2021img2pose, title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation}, author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages={7617--7627}, year={2021} } @inproceedings{zhu2022celebvhq, title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset}, author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change}, booktitle={Proceedings of the European Conference on Computer Vision (ECCV)}, year={2022} } ``` ## License MIT (this distillation). The teacher (`img2pose`) is BSD-3, and the training corpus (CelebV-HQ) is released for non-commercial research use — please honor each upstream license if you re-train or re-distribute. ## Files - `pose_mlp_v2.safetensors` — model weights (1 MB) - `pose_mlp_v2.json` — architecture, output-normalization stats, training history, validation MAE per epoch - `README.md` — this card ## Acknowledgments Distilled from img2pose by Vítor Albiero et al. (Meta AI / NVIDIA), trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and maintained by [Cosanlab](https://cosanlab.com) at Dartmouth.