Edit model card

punct_restore_fr

This model is a fine-tuned version of camembert-base on a raw, French opensubtitles dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0301
  • Precision: 0.9601
  • Recall: 0.9527
  • F1: 0.9564
  • Accuracy: 0.9915

Model description

Classifies tokens based on beginning of French sentences (B-SENT) and everything else (O).

Intended uses & limitations

This model aims to help punctuation restoration on French YouTube auto-generated subtitles. In doing so, one can measure more in a corpus such as words per sentence, grammar structures per sentence, etc.

Training and evaluation data

1 million Open Subtitles (French) sentences. 80%/10%/10% training/validation/test split.

The sentences:

  • were lower-cased
  • had end punctuation (.?!) removed
  • were of length between 7 and 70 words
  • had beginning word of sentence tagged with B-SENT.
    • All other words marked with O.

Token/tag pairs batched together in groups of 64. This helps show variety of positions for B-SENT and O tags. This also keeps training examples from just being one sentence. Otherwise, this leads to having the first word and only the first word in a sequence being labeled B-SENT.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 1
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Framework versions

  • Transformers 4.8.1
  • Pytorch 1.9.0+cu102
  • Datasets 1.8.0
  • Tokenizers 0.10.3
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.