|
--- |
|
language: |
|
- en |
|
- zh |
|
tags: |
|
- clir |
|
- colbertx |
|
- plaidx |
|
- xlm-roberta-large |
|
datasets: |
|
- ms_marco |
|
- hltcoe/tdist-msmarco-scores |
|
task_categories: |
|
- text-retrieval |
|
- information-retrieval |
|
task_ids: |
|
- passage-retrieval |
|
- cross-language-retrieval |
|
license: mit |
|
--- |
|
|
|
# ColBERT-X for English-Chinese CLIR using Translate-Distill |
|
|
|
## CLIR Model Setting |
|
|
|
- Query language: English |
|
- Query length: 32 token max |
|
- Document language: Chinese |
|
- Document length: 180 token max (please use MaxP to aggregate the passage score if needed) |
|
|
|
## Model Description |
|
|
|
Translate-Distill is a training technique that produces state-of-the-art CLIR dense retrieval model through translation and distillation. |
|
`plaidx-large-zho-tdist-mt5xxl-zhozho` is trained with KL-Divergence from the mt5xxl MonoT5 reranker inferenced on |
|
Chinese translated MS MARCO training queries and Chinese translated passages. |
|
|
|
### Teacher Models: |
|
|
|
- `t53b`: [`castorini/monot5-3b-msmarco-10k`](https://huggingface.co/castorini/monot5-3b-msmarco-10k) |
|
- `mt5xxl`: [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k) |
|
|
|
### Training Parameters |
|
|
|
- learning rate: 5e-6 |
|
- update steps: 200,000 |
|
- nway (number of passages per query): 6 (randomly selected from 50) |
|
- per device batch size (number of query-passage set): 8 |
|
- training GPU: 8 NVIDIA V100 with 32 GB memory |
|
|
|
## Usage |
|
|
|
To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X. |
|
```bash |
|
pip install PLAID-X==0.3.1 |
|
``` |
|
|
|
Following code snippet loads the model through Huggingface API. |
|
```python |
|
from colbert.modeling.checkpoint import Checkpoint |
|
from colbert.infra import ColBERTConfig |
|
|
|
Checkpoint('hltcoe/plaidx-large-zho-tdist-mt5xxl-zhozho', colbert_config=ColBERTConfig()) |
|
``` |
|
|
|
For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb), |
|
which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial). |
|
|
|
## BibTeX entry and Citation Info |
|
|
|
Please cite the following two papers if you use the model. |
|
|
|
|
|
```bibtex |
|
@inproceedings{colbert-x, |
|
author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard}, |
|
title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models}, |
|
booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)}, |
|
year = {2022}, |
|
url = {https://arxiv.org/abs/2201.08471} |
|
} |
|
``` |
|
|
|
```bibtex |
|
@inproceedings{translate-distill, |
|
author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller}, |
|
title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation}, |
|
booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)}, |
|
year = {2024}, |
|
url = {https://arxiv.org/abs/2401.04810} |
|
} |
|
``` |
|
|