blaser-2.0-qe / README.md
cointegrated's picture
Add the readme
a775701
|
raw
history blame
3.52 kB
---
license: cc-by-nc-4.0
---
# BLASER 2.0
[[Paper]]()
BLASER 2.0 is the new version of BLASER ([Chen et al., 2023](https://aclanthology.org/2023.acl-long.504/)),
a family of models for automatic evaluation of machine translation quality.
BLASER 2.0 is based on [SONAR](https://huggingface.co/facebook/SONAR) sentence embeddings
and works with both speech and text modalities.
The actual model predicts a similarity score for the translated sentence based on the translation and the source sentence.
This, it can be applied in settings where reference translations are missing or if their quality is questionable.
In contrast, its sibling model, [BLASER 2.0-referenced](https://huggingface.co/facebook/blaser-2.0-ref), requires also a reference translation.
Supervised BLASER models are trained to predict cross-lingual semantic similarity scores,
XSTS ([Licht et al., 2022](https://aclanthology.org/2022.amta-research.24/)),
on a scale where 1 corresponds to completely unrelated sentences and
5 corresponds to fully semantically equivalent sentences.
The models predictions, though, are unbounded and can occasionally surpass these limits.
## Installation
See the SONAR github [repo](https://github.com/facebookresearch/SONAR) for the installation instructions.
## Usage
BLASER 2.0 models accept 1024-dimensional SONAR sentence embeddings as inputs,
and produce a single score as an output.
The code below illustrates their usage with text embeddings:
```Python
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model
blaser = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")
print(blaser(src=src_embs, mt=mt_embs).item()) # 4.708
```
With BLASER 2.0 models, SONAR text and speech embeddings can be used interchangeably.
## Model details
- **Developed by:** Seamless Communication et al.
- **License:** CC-BY-NC 4.0 license
- **Citation:** If you use BLASER 2.0 in your work, please cite:
```bibtex
@article{seamlessm4t2023,
title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
journal={ArXiv},
year={2023}
}
```