hltcoe
/

plaidx-large-zho-tdist-mt5xxl-zhozho

xlm-roberta-large

Inference Endpoints

Model card Files Files and versions Community

plaidx-large-zho-tdist-mt5xxl-zhozho / README.md

eugene-yang's picture

git update readme

bb0e87b 9 months ago

|

history blame contribute delete

2.99 kB

	---
	language:
	- en
	- zh
	tags:
	- clir
	- colbertx
	- plaidx
	- xlm-roberta-large
	datasets:
	- ms_marco
	- hltcoe/tdist-msmarco-scores
	task_categories:
	- text-retrieval
	- information-retrieval
	task_ids:
	- passage-retrieval
	- cross-language-retrieval
	license: mit
	---

	# ColBERT-X for English-Chinese CLIR using Translate-Distill

	## CLIR Model Setting

	- Query language: English
	- Query length: 32 token max
	- Document language: Chinese
	- Document length: 180 token max (please use MaxP to aggregate the passage score if needed)

	## Model Description

	Translate-Distill is a training technique that produces state-of-the-art CLIR dense retrieval model through translation and distillation.
	`plaidx-large-zho-tdist-mt5xxl-zhozho` is trained with KL-Divergence from the mt5xxl MonoT5 reranker inferenced on
	Chinese translated MS MARCO training queries and Chinese translated passages.

	### Teacher Models:

	- `t53b`: [`castorini/monot5-3b-msmarco-10k`](https://huggingface.co/castorini/monot5-3b-msmarco-10k)
	- `mt5xxl`: [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)

	### Training Parameters

	- learning rate: 5e-6
	- update steps: 200,000
	- nway (number of passages per query): 6 (randomly selected from 50)
	- per device batch size (number of query-passage set): 8
	- training GPU: 8 NVIDIA V100 with 32 GB memory

	## Usage

	To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
	```bash
	pip install PLAID-X==0.3.1
	```

	Following code snippet loads the model through Huggingface API.
	```python
	from colbert.modeling.checkpoint import Checkpoint
	from colbert.infra import ColBERTConfig

	Checkpoint('hltcoe/plaidx-large-zho-tdist-mt5xxl-zhozho', colbert_config=ColBERTConfig())
	```

	For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
	which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).

	## BibTeX entry and Citation Info

	Please cite the following two papers if you use the model.


	```bibtex
	@inproceedings{colbert-x,
	author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},
	title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022},
	url = {https://arxiv.org/abs/2201.08471}
	}
	```

	```bibtex
	@inproceedings{translate-distill,
	author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller},
	title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation},
	booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)},
	year = {2024},
	url = {https://arxiv.org/abs/2401.04810}
	}
	```