sentinella / README.md
DeepMount00's picture
Update README.md
2ff4d02 verified
metadata
license: apache-2.0
language:
  - it
  - en
pipeline_tag: text-classification

πŸ›‘οΈ Sentinella: Lightweight Content Safety Guardian

🎯 Model Overview

Sentinella is a compact yet powerful content safety classifier designed specifically for Italian language moderation. This model serves as your efficient first line of defense against harmful content.

πŸ“Š Key Metrics

  • Size: 32M parameters
  • Accuracy: 93% on test set
  • Max Input Length: 8,192 tokens
  • Training Data: more than 100,000 balanced examples (harmful/safe)

πŸ”§ Technical Specifications

Base Architecture

  • Base Model: jinaai/jina-embeddings-v2-small-en
  • Model Adaptation:
    • Enhanced with a custom classifier head using a two-layer architecture
    • Optimized dropout rate of 0.1 for regularization
    • CLS token pooling strategy for sequence representation
    • Implemented with cross-entropy loss for binary classification

Classification Details

  • Output Labels:
    • NEGATIVE (0): Harmful content
    • POSITIVE (1): Safe content

πŸ’« Key Features

  • Lightweight: At just 32M parameters, Sentinella is designed for efficiency
  • Long Context: Handles up to 8k tokens of input text
  • High Performance: 93% accuracy in content safety classification
  • Optimized Architecture: Custom classification head with dimensionality reduction for improved efficiency

πŸš€ Use Cases

  • Content moderation for Italian text
  • Safe content filtering
  • Automated content screening
  • Real-time text analysis

πŸŽ“ Training Details

  • Training Dataset: more than 100,000 examples
    • Balanced distribution of safe and harmful content
    • Focused on Italian language text
  • Training Strategy:
    • Fine-tuned embedding representation
    • Intermediate layer dimensionality reduction
    • ReLU activation for non-linearity
    • Optimized dropout for regularization

πŸ“ˆ Performance Considerations

  • Optimized for real-time classification
  • Low memory footprint
  • Efficient inference time
  • Suitable for both CPU and GPU deployment

πŸ“ Citation

If you use Sentinella in your research or application, please cite this work as:

@model{sentinella,
  title={Sentinella: Lightweight Italian Content Safety Classifier},
  year={2024},
  publisher={[Michele Montebovi]},
  note={32M parameter content safety model}
}