|
--- |
|
license: apache-2.0 |
|
language: |
|
- it |
|
- en |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# π‘οΈ Sentinella: Lightweight Content Safety Guardian |
|
|
|
## π― Model Overview |
|
Sentinella is a compact yet powerful content safety classifier designed specifically for Italian language moderation. This model serves as your efficient first line of defense against harmful content. |
|
|
|
### π Key Metrics |
|
- **Size**: 32M parameters |
|
- **Accuracy**: 93% on test set |
|
- **Max Input Length**: 8,192 tokens |
|
- **Training Data**: more than 100,000 balanced examples (harmful/safe) |
|
|
|
## π§ Technical Specifications |
|
### Base Architecture |
|
- **Base Model**: jinaai/jina-embeddings-v2-small-en |
|
- **Model Adaptation**: |
|
- Enhanced with a custom classifier head using a two-layer architecture |
|
- Optimized dropout rate of 0.1 for regularization |
|
- CLS token pooling strategy for sequence representation |
|
- Implemented with cross-entropy loss for binary classification |
|
|
|
### Classification Details |
|
- **Output Labels**: |
|
- NEGATIVE (0): Harmful content |
|
- POSITIVE (1): Safe content |
|
|
|
## π« Key Features |
|
- **Lightweight**: At just 32M parameters, Sentinella is designed for efficiency |
|
- **Long Context**: Handles up to 8k tokens of input text |
|
- **High Performance**: 93% accuracy in content safety classification |
|
- **Optimized Architecture**: Custom classification head with dimensionality reduction for improved efficiency |
|
|
|
## π Use Cases |
|
- Content moderation for Italian text |
|
- Safe content filtering |
|
- Automated content screening |
|
- Real-time text analysis |
|
|
|
## π Training Details |
|
- **Training Dataset**: more than 100,000 examples |
|
- Balanced distribution of safe and harmful content |
|
- Focused on Italian language text |
|
- **Training Strategy**: |
|
- Fine-tuned embedding representation |
|
- Intermediate layer dimensionality reduction |
|
- ReLU activation for non-linearity |
|
- Optimized dropout for regularization |
|
|
|
## π Performance Considerations |
|
- Optimized for real-time classification |
|
- Low memory footprint |
|
- Efficient inference time |
|
- Suitable for both CPU and GPU deployment |
|
|
|
## π Citation |
|
If you use Sentinella in your research or application, please cite this work as: |
|
``` |
|
@model{sentinella, |
|
title={Sentinella: Lightweight Italian Content Safety Classifier}, |
|
year={2024}, |
|
publisher={[Michele Montebovi]}, |
|
note={32M parameter content safety model} |
|
} |
|
``` |