Add Zenodo citation (DOI: 10.5281/zenodo.19765336)

7d0e89a verified 30 days ago

5.1 kB

	---
	license: mit
	library_name: pytorch
	pipeline_tag: image-classification
	tags:
	- blur-detection
	- image-quality
	- mobilenet
	- magika
	- laplacian
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	- roc_auc
	model-index:
	- name: MagikaDocumentFromPixel — Lightweight Blur Detector
	results:
	- task:
	type: image-classification
	name: Blur Detection (sharp / blurred)
	dataset:
	type: gopro-large
	name: GoPro Large (test split)
	metrics:
	- type: f1
	value: 0.9803
	- type: accuracy
	value: 0.9806
	- type: precision
	value: 0.9981
	- type: recall
	value: 0.9631
	- type: roc_auc
	value: 0.9989
	---

	# MagikaDocumentFromPixel — Lightweight Blur Detector

	A Magika-inspired image quality gate that classifies images as `sharp`, `blurred`, or `uncertain` in a few milliseconds on CPU. Built to sit at the front of vision pipelines so expensive downstream models (OCR, detection, classification, VLMs) never waste compute on unusable input.

	GitHub repo (training code, Dockerfile, full README): [bradduy/MagikaDocumentFromPixel](https://github.com/bradduy/MagikaDocumentFromPixel)

	## Result on GoPro Large test split

	\| Metric \| Value \|
	\|---\|---\|
	\| F1 \| 0.9803 \|
	\| Accuracy \| 0.9806 \|
	\| Precision \| 0.9981 \|
	\| Recall \| 0.9631 \|
	\| AUC \| 0.9989 \|
	\| Model size \| 17 MB \|
	\| Inference latency \| ~17 ms / image (CPU, single-scale) \|

	## Recipe

	- Backbone: MobileNetV3-Large, ImageNet-pretrained, 2-class softmax head (~3.3M parameters).
	- Frequency-domain auxiliary channel (Freq-Aux): a per-image-standardized Laplacian magnitude map is concatenated to the RGB tensor as a 4th input channel. The first conv is expanded from 3→4 channels (pretrained RGB weights preserved; the new slice is initialized from the mean of the RGB kernels). The Laplacian gives the network an explicit, scale-invariant edge-energy cue.
	- Training: 384×384 input, AdamW lr=1e-4, CosineAnnealing, CrossEntropy, 25 epochs, medium augmentation, mixed-precision, GoPro Large with `blur_gamma` extra positives.
	- Inference: 5-scale multi-scale TTA at 256, 320, 384, 448, 512.
	- Routing: return `sharp` or `blurred` when max softmax ≥ 0.60, otherwise return `uncertain`.

	## Files

	- `best.pt` — PyTorch state dict for the `FreqAuxModel(MobileNetV3-Large)` 4-channel-input model.

	## Usage

	Clone the GitHub repo for the inference scripts, then load this checkpoint.

	```bash
	git clone https://github.com/bradduy/MagikaDocumentFromPixel.git
	cd MagikaDocumentFromPixel
	pip install -r blur_detector/requirements.txt

	# Download this checkpoint
	pip install huggingface_hub
	python -c "from huggingface_hub import hf_hub_download; \
	hf_hub_download('bradduy/MagikaDocumentFromPixel', 'best.pt', \
	local_dir='blur_detector/outputs/checkpoints/champion')"

	# Run inference
	python blur_detector/scripts/predict.py \
	--checkpoint blur_detector/outputs/checkpoints/champion/best.pt --freq_aux \
	path/to/image.jpg
	```

	Or in Python:

	```python
	from blur_detector.src.models.blur_detector import build_model
	from blur_detector.src.datasets.freq_aux import FreqAuxModel
	from blur_detector.src.inference.predictor import BlurPredictor
	import torch

	backbone = build_model("mobilenet_v3_large", pretrained=False, in_channels=4)
	model = FreqAuxModel(backbone)
	model.load_state_dict(torch.load("best.pt"))

	predictor = BlurPredictor(model, image_size=[256, 320, 384, 448, 512])
	pred = predictor.predict("receipt.jpg")
	print(pred.label, pred.confidence)
	```

	## Intended use

	- Pre-check before OCR / VLM / paid vision API calls.
	- Upload-time quality filter ("please retake the photo").
	- Dataset curation for ML programs.
	- Edge / on-device inference (single-scale 384px → ONNX → mobile/browser).

	## Limitations

	- Trained on GoPro motion blur. Domain-shift retraining is recommended for defocus blur, low-light, scanner skew, or compression artifacts.
	- Threshold (0.60) is a product-level knob — sweep on a small hand-labeled slice of your traffic to set the precision/recall trade-off.

	## Citation

	If you use this work in research or production, please cite:

	> Duy, Tran Thanh (2026). Edges Before Embeddings: A Confidence-Aware Blur Gate for Vision-Language Pipelines. Zenodo. https://doi.org/10.5281/zenodo.19765336

	BibTeX:

	```bibtex
	@misc{duy2026edges,
	author = {Duy, Tran Thanh},
	title = {Edges Before Embeddings: A Confidence-Aware Blur Gate for Vision-Language Pipelines},
	year = {2026},
	publisher = {Zenodo},
	doi = {10.5281/zenodo.19765336},
	url = {https://doi.org/10.5281/zenodo.19765336}
	}
	```

	## License

	MIT — see [LICENSE](https://github.com/bradduy/MagikaDocumentFromPixel/blob/main/LICENSE). Copyright © 2026 Duy Tran Thanh (Brad Duy).

	## Author

	Duy Tran Thanh (Brad Duy) — Sr. Applied AI Engineer

	- GitHub: [@bradduy](https://github.com/bradduy)
	- Hugging Face: [@bradduy](https://huggingface.co/bradduy)