TriLingual-BERT-Distil / README.md

luanafelbarros

Add new SentenceTransformer model

777f300 verified 10 months ago

preview code

raw

history blame contribute delete

25.1 kB

metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:3560698
  - loss:ModifiedMatryoshkaLoss
base_model: google-bert/bert-base-multilingual-cased
widget:
  - source_sentence: And then finally, turn it back to the real world.
    sentences:
      - Y luego, finalmente, devolver eso al mundo real.
      - Parece que el único rasgo que sobrevive a la decapitación es la vanidad.
      - y yo digo que no estoy seguro. Voy a pensarlo a groso modo.
  - source_sentence: Figure out some of the other options that are much better.
    sentences:
      - Piensen en otras de las opciones que son mucho mejores.
      - >-
        Éste solía ser un tema bipartidista, y sé que en este grupo realmente lo
        es.
      - >-
        El acuerdo general de paz para Sudán firmado en 2005 resultó ser menos
        amplio que lo previsto, y sus disposiciones aún podrían engendrar un
        retorno a gran escala de la guerra entre el norte y el sur.
  - source_sentence: >-
      The call to action I offer today -- my TED wish -- is this: Honor the
      treaties.
    sentences:
      - Esta es la intersección más directa, obvia, de las dos cosas.
      - >-
        El llamado a la acción que propongo hoy, mi TED Wish, es el siguiente:
        Honrar los tratados.
      - >-
        Los restaurantes del condado se pueden contar con los dedos de una
        mano... Barbacoa Bunn es mi favorito.
  - source_sentence: So for us, this was a graphic public campaign called Connect Bertie.
    sentences:
      - Para nosotros esto era una campaña gráfica llamada Conecta a Bertie.
      - >-
        En cambio, los líderes locales se comprometieron a revisarlos más
        adelante.
      - Con el tiempo, la gente hace lo que se le paga por hacer.
  - source_sentence: >-
      And in the audio world that's when the microphone gets too close to its
      sound source, and then it gets in this self-destructive loop that creates
      a very unpleasant sound.
    sentences:
      - Esta es una mina de Zimbabwe en este momento.
      - Estábamos en la I-40.
      - >-
        Y, en el mundo del audio, es cuando el micrófono se acerca demasiado a
        su fuente de sonido, y entra en este bucle autodestructivo que crea un
        sonido muy desagradable.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - negative_mse
model-index:
  - name: SentenceTransformer based on google-bert/bert-base-multilingual-cased
    results:
      - task:
          type: knowledge-distillation
          name: Knowledge Distillation
        dataset:
          name: MSE val en es
          type: MSE-val-en-es
        metrics:
          - type: negative_mse
            value: -29.5114666223526
            name: Negative Mse
      - task:
          type: knowledge-distillation
          name: Knowledge Distillation
        dataset:
          name: MSE val en pt
          type: MSE-val-en-pt
        metrics:
          - type: negative_mse
            value: -29.913604259490967
            name: Negative Mse
      - task:
          type: knowledge-distillation
          name: Knowledge Distillation
        dataset:
          name: MSE val en pt br
          type: MSE-val-en-pt-br
        metrics:
          - type: negative_mse
            value: -27.732226252555847
            name: Negative Mse

SentenceTransformer based on google-bert/bert-base-multilingual-cased

This is a sentence-transformers model finetuned from google-bert/bert-base-multilingual-cased. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: google-bert/bert-base-multilingual-cased
Maximum Sequence Length: 128 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("luanafelbarros/TriLingual-BERT-Distil")
# Run inference
sentences = [
    "And in the audio world that's when the microphone gets too close to its sound source, and then it gets in this self-destructive loop that creates a very unpleasant sound.",
    'Y, en el mundo del audio, es cuando el micrófono se acerca demasiado a su fuente de sonido, y entra en este bucle autodestructivo que crea un sonido muy desagradable.',
    'Esta es una mina de Zimbabwe en este momento.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Knowledge Distillation

Datasets: MSE-val-en-es, MSE-val-en-pt and MSE-val-en-pt-br
Evaluated with MSEEvaluator

Metric	MSE-val-en-es	MSE-val-en-pt	MSE-val-en-pt-br
negative_mse	-29.5115	-29.9136	-27.7322

Training Details

Training Dataset

Unnamed Dataset

Size: 3,560,698 training samples
Columns: english, non_english, and label
Approximate statistics based on the first 1000 samples:
english non_english label
type string string list
details
min: 4 tokens
mean: 25.46 tokens
max: 128 tokens

min: 4 tokens
mean: 26.67 tokens
max: 128 tokens

size: 768 elements

	english	non_english	label
type	string	string	list
details	min: 4 tokens mean: 25.46 tokens max: 128 tokens	min: 4 tokens mean: 26.67 tokens max: 128 tokens	size: 768 elements

Samples:

english	non_english	label
`And then there are certain conceptual things that can also benefit from hand calculating, but I think they're relatively small in number.`	`Y luego hay ciertas aspectos conceptuales que pueden beneficiarse del cálculo a mano pero creo que son relativamente pocos.`	`[-0.04180986061692238, 0.12620249390602112, -0.14501447975635529, 0.09695684909820557, -0.10850819200277328, ...]`
`One thing I often ask about is ancient Greek and how this relates.`	`Algo que pregunto a menudo es sobre el griego antiguo y cómo se relaciona.`	`[0.0034368489868938923, -0.02741478756070137, -0.09426739811897278, 0.04873204976320267, -0.008266829885542393, ...]`
`See, the thing we're doing right now is we're forcing people to learn mathematics.`	`Vean, lo que estamos haciendo ahora es forzar a la gente a aprender matemáticas.`	`[-0.05048828944563866, 0.2713043689727783, 0.024581076577305794, -0.07316197454929352, -0.044288791716098785, ...]`

Loss: main.ModifiedMatryoshkaLoss with these parameters:

{
    "loss": "MSELoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Evaluation Dataset

Unnamed Dataset

Size: 6,974 evaluation samples
Columns: english, non_english, and label
Approximate statistics based on the first 1000 samples:
english non_english label
type string string list
details
min: 4 tokens
mean: 25.68 tokens
max: 128 tokens

min: 4 tokens
mean: 27.31 tokens
max: 128 tokens

size: 768 elements

	english	non_english	label
type	string	string	list
details	min: 4 tokens mean: 25.68 tokens max: 128 tokens	min: 4 tokens mean: 27.31 tokens max: 128 tokens	size: 768 elements

Samples:

english	non_english	label
`Thank you so much, Chris.`	`Muchas gracias Chris.`	`[-0.1432434469461441, -0.10335833579301834, -0.07549277693033218, -0.1542435735464096, 0.009247343055903912, ...]`
`And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.`	`Y es en verdad un gran honor tener la oportunidad de venir a este escenario por segunda vez. Estoy extremadamente agradecido.`	`[0.02740730345249176, -0.0601208470761776, -0.023767368867993355, 0.02245006151497364, 0.007412586361169815, ...]`
`I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.`	`He quedado conmovido por esta conferencia, y deseo agradecer a todos ustedes sus amables comentarios acerca de lo que tenía que decir la otra noche.`	`[-0.09117366373538971, 0.08627621084451675, -0.05912208557128906, -0.007647979073226452, 0.0008422975661233068, ...]`

Loss: main.ModifiedMatryoshkaLoss with these parameters:

{
    "loss": "MSELoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 200
per_device_eval_batch_size: 200
learning_rate: 2e-05
num_train_epochs: 2
warmup_ratio: 0.1
fp16: True
label_names: ['label']

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 200
per_device_eval_batch_size: 200
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: ['label']
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	Validation Loss	MSE-val-en-es_negative_mse	MSE-val-en-pt_negative_mse	MSE-val-en-pt-br_negative_mse
0.0562	1000	0.0626	0.0513	-21.2968	-20.7534	-24.2460
0.1123	2000	0.0478	0.0432	-22.1192	-21.8663	-23.2775
0.1685	3000	0.0423	0.0391	-21.6697	-21.5869	-21.6856
0.0562	1000	0.0396	0.0376	-21.7666	-21.7181	-21.6779
0.1123	2000	0.0381	0.0358	-23.4969	-23.5022	-22.9817
0.1685	3000	0.0362	0.0339	-24.7639	-24.8878	-23.8888
0.2247	4000	0.0347	0.0323	-26.5721	-26.7422	-25.4072
0.2808	5000	0.0332	0.0310	-27.6024	-27.8268	-26.4132
0.3370	6000	0.0321	0.0299	-27.7974	-28.0294	-26.6213
0.3932	7000	0.0312	0.0292	-28.2719	-28.4834	-27.0468
0.4493	8000	0.0305	0.0285	-28.2561	-28.5574	-26.8752
0.5055	9000	0.0299	0.0280	-28.6342	-28.9112	-27.2933
0.5617	10000	0.0294	0.0275	-28.5512	-28.8469	-27.1072
0.6178	11000	0.029	0.0271	-28.6788	-28.9608	-27.2056
0.6740	12000	0.0286	0.0267	-29.0159	-29.3281	-27.4770
0.7302	13000	0.0283	0.0264	-28.9224	-29.2461	-27.3500
0.7863	14000	0.028	0.0261	-29.1044	-29.4303	-27.4377
0.8425	15000	0.0277	0.0259	-29.2340	-29.5758	-27.6223
0.8987	16000	0.0275	0.0257	-29.1356	-29.4699	-27.4667
0.9548	17000	0.0273	0.0255	-29.3281	-29.6671	-27.7174
1.0110	18000	0.0271	0.0253	-29.2991	-29.6635	-27.6675
1.0672	19000	0.0268	0.0251	-29.3581	-29.7326	-27.6587
1.1233	20000	0.0266	0.0250	-29.4233	-29.7941	-27.7913
1.1795	21000	0.0265	0.0248	-29.3941	-29.7583	-27.6951
1.2357	22000	0.0264	0.0247	-29.5963	-29.9737	-27.9191
1.2918	23000	0.0262	0.0245	-29.4587	-29.8472	-27.7702
1.3480	24000	0.0262	0.0244	-29.4977	-29.8868	-27.8142
1.4042	25000	0.026	0.0244	-29.5356	-29.9184	-27.8426
1.4603	26000	0.0259	0.0243	-29.5614	-29.9388	-27.8360
1.5165	27000	0.0259	0.0242	-29.5362	-29.9353	-27.8223
1.5727	28000	0.0258	0.0241	-29.5088	-29.9043	-27.7884
1.6288	29000	0.0258	0.0241	-29.4550	-29.8543	-27.6788
1.6850	30000	0.0257	0.0240	-29.5373	-29.9282	-27.7855
1.7412	31000	0.0256	0.0239	-29.5195	-29.9096	-27.7866
1.7973	32000	0.0256	0.0239	-29.5292	-29.9266	-27.7579
1.8535	33000	0.0256	0.0239	-29.5202	-29.9196	-27.7408
1.9097	34000	0.0255	0.0239	-29.5090	-29.9126	-27.7311
1.9659	35000	0.0255	0.0238	-29.5115	-29.9136	-27.7322

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.3.1
Transformers: 4.46.3
PyTorch: 2.5.1+cu121
Accelerate: 1.1.1
Datasets: 3.2.0
Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}