| <h1 align="center">π Remedy-R: Generative Reasoning Models for MT Evaluation</h1> | |
| <p align="center"><b>Reasoning-driven, reinforcement-trained metrics for machine translation evaluation</b></p> | |
| --- | |
| ## β¨ What is Remedy-R? | |
| **Remedy-R** is a family of **reasoning-based MT evaluation models** trained with **reinforcement learning via verifiable rewards (RLVR)** on **pairwise human translation preferences**. | |
| Instead of directly regressing a scalar score, Remedy-R: | |
| - Generates **step-by-step analyses** of *accuracy*, *fluency*, and *completeness*. | |
| - Outputs a **final numeric score in [0, 100]** that can be parsed and used like a standard metric. | |
| - Is trained with **PPO + rule-based rewards** that check whether predicted preferences match human rankings and calibrate scores toward human ratings. | |
| - Supports both **reference-based** and **reference-free (QE)** evaluation. | |
| On WMT22β24 and MSLC24-style OOD stress tests, Remedy-R: | |
| - **Surpasses** strong LLM-as-judge methods. | |
| - Matches top-performing scalar SOTA metrics. | |
| - Remains **robust under OOD conditions** such as source copy, empty translations, wrong language, and mixed-language outputs. | |
| - Enables **Test-Time Scaling (TTS)** via multiple reasoning passes, improving segment-level meta-evaluation. | |
| - Powers **Remedy-R Agent**, an evaluateβrevise pipeline that improves translations for diverse base systems. | |
| --- | |
| ## π Contents | |
| - [β¨ What is Remedy-R?](#-what-is-remedy-r) | |
| - [π Contents](#-contents) | |
| - [π¦ Installation](#-installation) | |
| - [From PyPI (recommended)](#from-pypi-recommended) | |
| - [From source](#from-source) | |
| - [βοΈ Requirements](#οΈ-requirements) | |
| - [π§ Model Zoo](#-model-zoo) | |
| - [π Quickstart](#-quickstart) | |
| - [CLI: Local vLLM Inference](#cli-local-vllm-inference) | |
| - [Reference-Free / QE Mode](#reference-free--qe-mode) | |
| - [Test-Time Scaling (TTS)](#test-time-scaling-tts) | |
| - [π Optional: vLLM Online Serving](#-optional-vllm-online-serving) | |
| - [π Outputs](#-outputs) | |
| - [π Citation](#-citation) | |
| --- | |
| ## π¦ Installation | |
| ### From PyPI (unavailable for now) | |
| ```bash | |
| pip install --upgrade pip | |
| pip install remedy-r-mt-eval | |
| ```` | |
| This installs the `remedy_r` package and the CLI entrypoint `remedy-r-score` (plus related tools). | |
| ### From source | |
| ```bash | |
| git clone https://github.com/Smu-Tan/Remedy-R.git | |
| cd Remedy-R | |
| pip install -e . | |
| ``` | |
| --- | |
| ## βοΈ Requirements | |
| Core runtime dependencies (see `pyproject.toml` for exact versions): | |
| * Python β₯ 3.10 (tested mostly with 3.12) | |
| * [PyTorch](https://pytorch.org/) with GPU support | |
| * [vLLM](https://github.com/vllm-project/vllm) for efficient batched inference | |
| * `transformers`, `numpy`, `pandas`, `tqdm` | |
| You also need: | |
| * At least **1 GPU (16β24 GB)** for 7B models | |
| * More memory/GPUs for 14B/32B models or large batch sizes | |
| --- | |
| ## π§ Model Zoo | |
| Remedy-R models are hosted on HuggingFace under `ShaomuTan/`: | |
| | Model | Size | Base model | Mode | Link | | |
| | ------------ | ---- | ----------- | -------- | --------------------------- | | |
| | Remedy-R-7B | 7B | Qwen2.5-7B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-7B) | | |
| | Remedy-R-14B | 14B | Qwen2.5-14B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-14B) | | |
| | Remedy-R-32B | 32B | Qwen2.5-32B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-32B) | | |
| You can cache them locally: | |
| ```bash | |
| HF_HUB_ENABLE_HF_TRANSFER=1 \ | |
| huggingface-cli download ShaomuTan/Remedy-R-14B \ | |
| --local-dir Models/Remedy-R-14B | |
| ``` | |
| Then point `--model` to either the **HF ID** or the **local path**. | |
| --- | |
| ## π Quickstart | |
| ### CLI: Local vLLM Inference | |
| The main entrypoint is: | |
| ```bash | |
| remedy-r-score \ | |
| --model "$MODEL_CHECKPOINT" \ | |
| --save_metric_name "$METRIC_NAME" \ | |
| --output_dir "$DATA_DIR" \ | |
| --max-tokens "$MAX_TOKENS" \ | |
| --tp_size "$TP_SIZE" \ | |
| --dp_size "$DP_SIZE" \ | |
| --temperature "$DEC_TEMPERATURE" \ | |
| --repetition_penalty "$REPETITION_PENALTY" \ | |
| --gpu-memory-utilization "$GPU_MEM_UTIL" \ | |
| --max-model-len "$MAX_MODEL_LEN" \ | |
| --seed "$SEED" \ | |
| --src-file "$SRC_FILE" \ | |
| --mt-file "$MT_FILE" \ | |
| --lp "$LP" \ | |
| ``` | |
| **Key arguments** | |
| * `--model` : HF repo ID or local checkpoint | |
| * `--src-file` : Source sentences (one per line) | |
| * `--mt-file` : MT outputs (one per line) | |
| * `--ref-file` : Reference translations (optional; enables ref-based mode) | |
| * `--lp` : Language-pair codes (e.g., `en-de`) | |
| * `--output_dir` : Output folder | |
| * `--temperature` : Generation temperature | |
| * `--tp_size` : Tensor parallel size | |
| * `--dp_size` : Data parallel size | |
| * `--num-seqs` : Max parallel sequences per step | |
| * `--max-tokens` : Max generation token numebrs | |
| * `--gpu-memory-utilization` : vLLM memory ratio (e.g. 0.9) | |
| You can also call the CLI via Python: | |
| ```bash | |
| python -m remedy_r.cli.score \ | |
| --model ShaomuTan/Remedy-R-7B \ | |
| ... | |
| ``` | |
| --- | |
| ### Reference-Free / QE Mode | |
| If you donβt have references, just drop `--ref-file` and add `--no-ref`: | |
| ```bash | |
| remedy-r-score \ | |
| --model ShaomuTan/Remedy-R-7B \ | |
| --src-file ./testcase/en.src \ | |
| --mt-file ./testcase/en-de.hyp \ | |
| --no-ref \ | |
| --src-lang en \ | |
| --tgt-lang de \ | |
| --save-dir ./testcase \ | |
| --cache-dir ./Models | |
| ``` | |
| The prompt automatically switches to **reference-free quality estimation** while keeping the same [0, 100] score scale. | |
| --- | |
| ### Test-Time Scaling (TTS) | |
| Remedy-R supports **Test-Time Scaling** by averaging multiple independent evaluation passes with different seeds: | |
| ```bash | |
| remedy-r-score \ | |
| --model ShaomuTan/Remedy-R-14B \ | |
| --src-file ./testcase/en.src \ | |
| --mt-file ./testcase/en-de.hyp \ | |
| --ref-file ./testcase/de.ref \ | |
| --src-lang en --tgt-lang de \ | |
| --save-dir ./testcase_tts \ | |
| --TTS \ | |
| --best-of-n 4 \ | |
| --seed 42 | |
| ``` | |
| * `--TTS` : Enable multi-pass evaluation | |
| * `--best-of-n` : Number of independent passes (e.g., 2β6) | |
| * Scores are averaged; the detailed per-pass scores can be optionally logged. | |
| TTS typically improves **segment-level pairwise accuracy** and stabilizes scores for difficult segments. | |
| --- | |
| ## π Optional: vLLM Online Serving | |
| To avoid re-loading the model for every scoring run, you can: | |
| 1. **Start a local vLLM server** (OpenAI-compatible): | |
| ```bash | |
| remedy-r-serve \ | |
| --model ShaomuTan/Remedy-R-14B \ | |
| --port 8000 \ | |
| --max-model-len 4096 \ | |
| --gpu-memory-utilization 0.9 | |
| ``` | |
| 2. **Score via the server**: | |
| ```bash | |
| remedy-r-score \ | |
| --src-file ./testcase/en.src \ | |
| --mt-file ./testcase/en-de.hyp \ | |
| --ref-file ./testcase/de.ref \ | |
| --lp en-de \ | |
| --save_metric_name Remedy-R-14B \ | |
| --save-dir ./testcase_server \ | |
| --server-url http://localhost:8000/v1 | |
| ``` | |
| Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating `LLM()` in every process. | |
| --- | |
| ## π Outputs | |
| For each language pair `SRC-TGT`, Remedy-R writes: | |
| * `results.jsonl` | |
| * `segment_scores.tsv` | |
| * `system_score.txt` | |
| ## π Citation | |
| If you use Remedy-R or this codebase, please cite: | |
| Arxiv coming soon... | |