Text Generation
Transformers
Safetensors
mt5
text2text-generation
YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

5CD-AI/visocial-T5-base

Overview

We trimmed vocabulary size to 50,589 and continually pretrained google/mt5-base[1] on a merged 20GB dataset, the training dataset includes:

  • Crawled data (100M comments and 15M posts on Facebook)
  • UIT data[2], which is used to pretrain uitnlp/visobert[2]
  • MC4 ecommerce
  • 10.7M comments on VOZ Forum from tarudesu/VOZ-HSD[7]
  • 3.6M reviews from Amazon[3] translated into Vietnamese from 5CD-AI/Vietnamese-amazon_polarity-gg-translated

Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):

Model Average MF1 Hate Speech Detection Toxic Speech Detection Hate Spans Detection
Acc WF1 MF1 Acc WF1 MF1 Acc WF1 MF1
PhoBERT[4] 69.63 86.75 86.52 64.76 90.78 90.27 71.31 84.65 81.12 72.81
PhoBERT_v2[4] 70.50 87.42 87.33 66.60 90.23 89.78 71.39 84.92 81.51 73.51
viBERT[5] 67.80 86.33 85.79 62.85 88.81 88.17 67.65 84.63 81.28 72.91
ViSoBERT[6] 75.07 88.17 87.86 67.71 90.35 90.16 71.45 90.16 90.07 86.04
ViHateT5[7] 75.56 88.76 89.14 68.67 90.80 91.78 71.63 91.00 90.20 86.37
visocial-T5-base(Ours) 78.01 89.51 89.78 71.19 92.2 93.47 73.81 92.57 92.20 89.04

Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score:

Model MF1
Hate Speech Detection Toxic Speech Detection Hate Spans Detection
mT5[1] 66.76 69.93 86.60
ViT5[8] 66.95 64.82 86.90
ViHateT5[7] 68.67 71.63 86.37
visocial-T5-base(Ours) 71.90 73.81 89.04

Fine-tune Configuration

We fine-tune 5CD-AI/visocial-T5-base on 3 downstream tasks with transformers library with the following configuration:

  • seed: 42
  • training_epochs: 4
  • train_batch_size: 4
  • gradient_accumulation_steps: 8
  • learning_rate: 3e-4
  • lr_scheduler_type: linear
  • model_max_length: 256
  • metric_for_best_model: eval_loss
  • evaluation_strategy: steps
  • eval_steps=0.1

References

[1] mT5: A massively multilingual pre-trained text-to-text transformer

[2] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

[3] The Amazon Polarity dataset

[4] PhoBERT: Pre-trained language models for Vietnamese

[5] Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models

[6] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

[7] ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model

[8] ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

Downloads last month
5
Safetensors
Model size
276M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support