| --- |
| license: mit |
| library_name: pytorch |
| pipeline_tag: image-classification |
| tags: |
| - blur-detection |
| - image-quality |
| - mobilenet |
| - magika |
| - laplacian |
| metrics: |
| - f1 |
| - accuracy |
| - precision |
| - recall |
| - roc_auc |
| model-index: |
| - name: MagikaDocumentFromPixel — Lightweight Blur Detector |
| results: |
| - task: |
| type: image-classification |
| name: Blur Detection (sharp / blurred) |
| dataset: |
| type: gopro-large |
| name: GoPro Large (test split) |
| metrics: |
| - type: f1 |
| value: 0.9803 |
| - type: accuracy |
| value: 0.9806 |
| - type: precision |
| value: 0.9981 |
| - type: recall |
| value: 0.9631 |
| - type: roc_auc |
| value: 0.9989 |
| --- |
| |
| # MagikaDocumentFromPixel — Lightweight Blur Detector |
|
|
| A **Magika-inspired image quality gate** that classifies images as `sharp`, `blurred`, or `uncertain` in a few milliseconds on CPU. Built to sit at the front of vision pipelines so expensive downstream models (OCR, detection, classification, VLMs) never waste compute on unusable input. |
|
|
| GitHub repo (training code, Dockerfile, full README): **[bradduy/MagikaDocumentFromPixel](https://github.com/bradduy/MagikaDocumentFromPixel)** |
|
|
| ## Result on GoPro Large test split |
|
|
| | Metric | Value | |
| |---|---| |
| | F1 | **0.9803** | |
| | Accuracy | 0.9806 | |
| | Precision | 0.9981 | |
| | Recall | 0.9631 | |
| | AUC | 0.9989 | |
| | Model size | 17 MB | |
| | Inference latency | ~17 ms / image (CPU, single-scale) | |
|
|
| ## Recipe |
|
|
| - **Backbone**: MobileNetV3-Large, ImageNet-pretrained, 2-class softmax head (~3.3M parameters). |
| - **Frequency-domain auxiliary channel (Freq-Aux)**: a per-image-standardized Laplacian magnitude map is concatenated to the RGB tensor as a **4th input channel**. The first conv is expanded from 3→4 channels (pretrained RGB weights preserved; the new slice is initialized from the mean of the RGB kernels). The Laplacian gives the network an explicit, scale-invariant edge-energy cue. |
| - **Training**: 384×384 input, AdamW lr=1e-4, CosineAnnealing, CrossEntropy, 25 epochs, medium augmentation, mixed-precision, GoPro Large with `blur_gamma` extra positives. |
| - **Inference**: 5-scale multi-scale TTA at 256, 320, 384, 448, 512. |
| - **Routing**: return `sharp` or `blurred` when max softmax ≥ 0.60, otherwise return `uncertain`. |
|
|
| ## Files |
|
|
| - `best.pt` — PyTorch state dict for the `FreqAuxModel(MobileNetV3-Large)` 4-channel-input model. |
|
|
| ## Usage |
|
|
| Clone the GitHub repo for the inference scripts, then load this checkpoint. |
|
|
| ```bash |
| git clone https://github.com/bradduy/MagikaDocumentFromPixel.git |
| cd MagikaDocumentFromPixel |
| pip install -r blur_detector/requirements.txt |
| |
| # Download this checkpoint |
| pip install huggingface_hub |
| python -c "from huggingface_hub import hf_hub_download; \ |
| hf_hub_download('bradduy/MagikaDocumentFromPixel', 'best.pt', \ |
| local_dir='blur_detector/outputs/checkpoints/champion')" |
| |
| # Run inference |
| python blur_detector/scripts/predict.py \ |
| --checkpoint blur_detector/outputs/checkpoints/champion/best.pt --freq_aux \ |
| path/to/image.jpg |
| ``` |
|
|
| Or in Python: |
|
|
| ```python |
| from blur_detector.src.models.blur_detector import build_model |
| from blur_detector.src.datasets.freq_aux import FreqAuxModel |
| from blur_detector.src.inference.predictor import BlurPredictor |
| import torch |
| |
| backbone = build_model("mobilenet_v3_large", pretrained=False, in_channels=4) |
| model = FreqAuxModel(backbone) |
| model.load_state_dict(torch.load("best.pt")) |
| |
| predictor = BlurPredictor(model, image_size=[256, 320, 384, 448, 512]) |
| pred = predictor.predict("receipt.jpg") |
| print(pred.label, pred.confidence) |
| ``` |
|
|
| ## Intended use |
|
|
| - Pre-check before OCR / VLM / paid vision API calls. |
| - Upload-time quality filter ("please retake the photo"). |
| - Dataset curation for ML programs. |
| - Edge / on-device inference (single-scale 384px → ONNX → mobile/browser). |
|
|
| ## Limitations |
|
|
| - Trained on GoPro motion blur. Domain-shift retraining is recommended for defocus blur, low-light, scanner skew, or compression artifacts. |
| - Threshold (0.60) is a product-level knob — sweep on a small hand-labeled slice of your traffic to set the precision/recall trade-off. |
|
|
| ## Citation |
|
|
| If you use this work in research or production, please cite: |
|
|
| > Duy, Tran Thanh (2026). *Edges Before Embeddings: A Confidence-Aware Blur Gate for Vision-Language Pipelines.* Zenodo. https://doi.org/10.5281/zenodo.19765336 |
|
|
| BibTeX: |
|
|
| ```bibtex |
| @misc{duy2026edges, |
| author = {Duy, Tran Thanh}, |
| title = {Edges Before Embeddings: A Confidence-Aware Blur Gate for Vision-Language Pipelines}, |
| year = {2026}, |
| publisher = {Zenodo}, |
| doi = {10.5281/zenodo.19765336}, |
| url = {https://doi.org/10.5281/zenodo.19765336} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT — see [LICENSE](https://github.com/bradduy/MagikaDocumentFromPixel/blob/main/LICENSE). Copyright © 2026 Duy Tran Thanh (Brad Duy). |
|
|
| ## Author |
|
|
| **Duy Tran Thanh (Brad Duy)** — Sr. Applied AI Engineer |
|
|
| - GitHub: [@bradduy](https://github.com/bradduy) |
| - Hugging Face: [@bradduy](https://huggingface.co/bradduy) |
|
|