--- license: cc-by-nc-sa-4.0 datasets: - wi_locness - matejklemen/falko_merlin - paws - paws-x - asset language: - en - de - es - ar - ja - ko - zh metrics: - bleu - rouge - sari - accuracy library_name: transformers --- # Model Card for mEdIT-xxl The `medit-xxl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-13b-lora` model on the mEdIT dataset. **Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning **Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar ## Model Details ### Model Description - **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish - **Finetuned from model:** `MBZUAI/bactrian-x-llama-13b-lora` ### Model Sources - **Repository:** https://github.com/vipulraheja/medit - **Paper:** https://arxiv.org/abs/2402.16472v1 ## How to use Given an edit instruction and an original text, our model can generate the edited version of the text.
![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png) Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text. ### Instruction format Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results. ``` instruction_tokens = [ "Instruction", "Anweisung", ... ] input_tokens = [ "Input", "Aporte", ... ] output_tokens = [ "Output", "Produzione", ... ] task_descriptions = [ "Fix grammatical errors in this sentence", # <-- GEC task "Umschreiben Sie den Satz", # <-- Paraphrasing ... ] ``` **The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.** ``` prompt_template = """### :\n\n### :\n\n### :\n\n""" ``` Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision). ### Run the model ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "grammarly/medit-xxl" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) # English GEC using Japanese instructions prompt = '### 命令:\n文章を文法的にする\n### 入力:\nI has small cat ,\n### 出力:\n\n' inputs = tokenizer(prompt, return_tensors='pt') outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0], skip_special_tokens=True) # --> I have a small cat , # German GEC using Japanese instructions prompt = '### 命令:\n文章を文法的にする\n### 入力:\nIch haben eines kleines Katze ,\n### 出力:\n\n' # ... # --> Ich habe eine kleine Katze , ``` #### Software https://github.com/vipulraheja/medit ## Citation **BibTeX:** ``` @article{raheja2023medit, title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning}, author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar}, year={2024}, eprint={2402.16472v1}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` **APA:** Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472