mPMR: A Multilingual Pre-trained Machine Reader at Scale

Multilingual Pre-trained Machine Reader (mPMR) is a multilingual extension of PMR. mPMR is pre-trained with 18 million Machine Reading Comprehension (MRC) examples constructed with Wikipedia Hyperlinks. It was introduced in the paper mPMR: A Multilingual Pre-trained Machine Reader at Scale by Weiwen Xu, Xin Li, Wai Lam, Lidong Bing and first released in this repository.

This model is initialized with xlm-roberta-base and further continued pre-trained with an MRC objective.

Model description

The model is pre-trained with distantly labeled data using a learning objective called Wiki Anchor Extraction (WAE). Specifically, we constructed a large volume of general-purpose and high-quality MRC-style training data based on Wikipedia anchors (i.e., hyperlinked texts). For each Wikipedia anchor, we composed a pair of correlated articles. One side of the pair is the Wikipedia article that contains detailed descriptions of the hyperlinked entity, which we defined as the definition article. The other side of the pair is the article that mentions the specific anchor text, which we defined as the mention article. We composed an MRC-style training instance in which the anchor is the answer, the surrounding passage of the anchor in the mention article is the context, and the definition of the anchor entity in the definition article is the query. Based on the above data, we then introduced a novel WAE problem as the pre-training task of mPMR. In this task, mPMR determines whether the context and the query are relevant. If so, mPMR extracts the answer from the context that satisfies the query description.

During fine-tuning, we unified downstream NLU tasks in our MRC formulation, which typically falls into four categories: (1) span extraction with pre-defined labels (e.g., NER) in which each task label is treated as a query to search the corresponding answers in the input text (context); (2) span extraction with natural questions (e.g., EQA) in which the question is treated as the query for answer extraction from the given passage (context); (3) sequence classification with pre-defined task labels, such as sentiment analysis. Each task label is used as a query for the input text (context); and (4) sequence classification with natural questions on multiple choices, such as multi-choice QA (MCQA). We treated the concatenation of the question and one choice as the query for the given passage (context). Then, in the output space, we tackle span extraction problems by predicting the probability of context span being the answer. We tackle sequence classification problems by conducting relevance classification on [CLS] (extracting [CLS] if relevant).

Model variations

There are two versions of models released. The details are:

Model	Backbone	#params
mPMR-base (this checkpoint)	xlm-roberta-base	270M
mPMR-large	xlm-roberta-large	550M

Intended uses & limitations

The models need to be fine-tuned on the data downstream tasks. During fine-tuning, no task-specific layer is required.

How to use

You can try the codes from this repo.

BibTeX entry and citation info

@article{xu2022clozing,
  title={From Clozing to Comprehending: Retrofitting Pre-trained Language Model to Pre-trained Machine Reader},
  author={Xu, Weiwen and Li, Xin and Zhang, Wenxuan and Zhou, Meng and Bing, Lidong and Lam, Wai and Si, Luo},
  journal={arXiv preprint arXiv:2212.04755},
  year={2022}
}
@inproceedings{xu2022mpmr,
    title = "mPMR: A Multilingual Pre-trained Machine Reader at Scale",
    author = "Xu, Weiwen  and
      Li, Xin  and
      Lam, Wai  and
     Bing, Lidong",
    booktitle = "The 61th Annual Meeting of the Association for Computational Linguistics.",
    year = "2023",
}