|
--- |
|
license: apache-2.0 |
|
language: |
|
- es |
|
- en |
|
pipeline_tag: translation |
|
--- |
|
|
|
# Spanish-English Translation Model for the Scientific Domain |
|
|
|
## Description |
|
|
|
This is a CTranslate2 Spanish-English translation model for the scientific domain, which uses the CA+OC+ES-EN OPUS-MT Transformer-Big [(link)](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/cat%2Boci%2Bspa-eng) as its base model. |
|
It has been fine-tuned on a large parallel corpus with scientific texts, with special focus to the four pilot domains of the [SciLake](https://scilake.eu/) project: |
|
- Neuroscience |
|
- Cancer |
|
- Transportation |
|
- Energy |
|
|
|
## Dataset |
|
|
|
The fine-tuning dataset consists of 4,145,412 EN-ES parallel sentences extracted from parallel theses and abstracts which have been acquired from multiple academic repositories. |
|
|
|
## Evaluation |
|
|
|
We have evaluated the base and the fine-tuned models on 5 test sets: |
|
- Four which correspond to the pilot domains (Neuroscience, Cancer, Transportation, Energy) with each one containing 1,000 parallel sentences. |
|
- A general scientific which contains 3,000 parallel sentences from a wide range of scientific texts in other domains. |
|
|
|
| Model | Average of 4 domains | | | General Scientific| | | |
|
|-------------|----------------------|---------------|---------------|-------------------|---------------|---------------| |
|
| | SacreBLEU | chrF2++ | COMET | SacreBLEU | chrF2++ | COMET | |
|
| Base | 49.7 | 70.5 | 69.5 | 51 | 71.7 | 68.9 | |
|
| Fine-Tuned | 51.9 | 71.7 | 70.9 | 54 | 73.1 | 71 | |
|
| Improvement | +2.2 | +1.2 | +1.4 | +3 | +1.4 | +2.1 | |
|
|
|
|
|
## Usage |
|
|
|
``` |
|
pip install ctranslate2 sentencepiece huggingface_hub |
|
``` |
|
|
|
```python |
|
import ctranslate2 |
|
import sentencepiece as spm |
|
from huggingface_hub import snapshot_download |
|
|
|
repo_id = "ilsp/opus-mt-big-es-en_ct2_ft-SciLake" |
|
|
|
# REPLACE WITH ACTUAL LOCAL DIRECTORY WHERE THE MODEL WILL BE DOWNLOADED |
|
local_dir = "" |
|
|
|
model_path = snapshot_download(repo_id=repo_id, local_dir=local_dir) |
|
|
|
translator = ctranslate2.Translator(model_path, compute_type="auto") |
|
|
|
sp_enc = spm.SentencePieceProcessor() |
|
sp_enc.load(f"{model_path}/source.spm") |
|
|
|
sp_dec = spm.SentencePieceProcessor() |
|
sp_dec.load(f"{model_path}/target.spm") |
|
|
|
def translate_text(input_text, sp_enc=sp_enc, sp_dec=sp_dec, translator=translator, beam_size=6): |
|
input_tokens = sp_enc.encode(input_text, out_type=str) |
|
results = translator.translate_batch([input_tokens], |
|
beam_size=beam_size, |
|
length_penalty=0, |
|
max_decoding_length=512, |
|
replace_unknowns=True) |
|
output_tokens = results[0].hypotheses[0] |
|
output_text = sp_dec.decode(output_tokens) |
|
return output_text |
|
|
|
input_text = "La energía eléctrica es un insumo base de alta difusión, derivado de su capacidad para satisfacer todo tipo de necesidades." |
|
translate_text(input_text) |
|
|
|
# OUTPUT |
|
# Electric power is a base input of high diffusion, derived from its ability to satisfy all types of needs. |
|
``` |
|
|
|
## Acknowledgements |
|
|
|
This work was created within the [SciLake](https://scilake.eu/) project. We are grateful to the SciLake project for providing the resources and support that made this work possible. This project has received funding from the European Union’s Horizon Europe framework programme under grant agreement No. 101058573. |