Remedy-R-14B / README.md

ShaomuTan

Update README.md

327ba35 verified about 2 months ago

preview code

raw

history blame contribute delete

7.1 kB

🚀 Remedy-R: Generative Reasoning Models for MT Evaluation

Reasoning-driven, reinforcement-trained metrics for machine translation evaluation

✨ What is Remedy-R?

Remedy-R is a family of reasoning-based MT evaluation models trained with reinforcement learning via verifiable rewards (RLVR) on pairwise human translation preferences.

Instead of directly regressing a scalar score, Remedy-R:

Generates step-by-step analyses of accuracy, fluency, and completeness.
Outputs a final numeric score in [0, 100] that can be parsed and used like a standard metric.
Is trained with PPO + rule-based rewards that check whether predicted preferences match human rankings and calibrate scores toward human ratings.
Supports both reference-based and reference-free (QE) evaluation.

On WMT22–24 and MSLC24-style OOD stress tests, Remedy-R:

Surpasses strong LLM-as-judge methods.
Matches top-performing scalar SOTA metrics.
Remains robust under OOD conditions such as source copy, empty translations, wrong language, and mixed-language outputs.
Enables Test-Time Scaling (TTS) via multiple reasoning passes, improving segment-level meta-evaluation.
Powers Remedy-R Agent, an evaluate–revise pipeline that improves translations for diverse base systems.

📚 Contents

✨ What is Remedy-R?
📚 Contents
📦 Installation
- From PyPI (recommended)
- From source
⚙️ Requirements
🧠 Model Zoo
🚀 Quickstart
🌐 Optional: vLLM Online Serving
📄 Outputs
📚 Citation

📦 Installation

From PyPI (unavailable for now)

pip install --upgrade pip
pip install remedy-r-mt-eval

This installs the remedy_r package and the CLI entrypoint remedy-r-score (plus related tools).

From source

git clone https://github.com/Smu-Tan/Remedy-R.git
cd Remedy-R
pip install -e .

⚙️ Requirements

Core runtime dependencies (see pyproject.toml for exact versions):

Python ≥ 3.10 (tested mostly with 3.12)
PyTorch with GPU support
vLLM for efficient batched inference
transformers, numpy, pandas, tqdm

You also need:

At least 1 GPU (16–24 GB) for 7B models
More memory/GPUs for 14B/32B models or large batch sizes

🧠 Model Zoo

Remedy-R models are hosted on HuggingFace under ShaomuTan/:

Model	Size	Base model	Mode	Link
Remedy-R-7B	7B	Qwen2.5-7B	Ref + QE	🤗 HuggingFace
Remedy-R-14B	14B	Qwen2.5-14B	Ref + QE	🤗 HuggingFace
Remedy-R-32B	32B	Qwen2.5-32B	Ref + QE	🤗 HuggingFace

You can cache them locally:

HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli download ShaomuTan/Remedy-R-14B \
  --local-dir Models/Remedy-R-14B

Then point --model to either the HF ID or the local path.

🚀 Quickstart

CLI: Local vLLM Inference

The main entrypoint is:

remedy-r-score \
  --model "$MODEL_CHECKPOINT" \
  --save_metric_name "$METRIC_NAME" \
  --output_dir "$DATA_DIR" \
  --max-tokens "$MAX_TOKENS" \
  --tp_size "$TP_SIZE" \
  --dp_size "$DP_SIZE" \
  --temperature "$DEC_TEMPERATURE" \
  --repetition_penalty "$REPETITION_PENALTY" \
  --gpu-memory-utilization "$GPU_MEM_UTIL" \
  --max-model-len "$MAX_MODEL_LEN" \
  --seed "$SEED" \
  --src-file "$SRC_FILE" \
  --mt-file  "$MT_FILE" \
  --lp "$LP" \

Key arguments

--model : HF repo ID or local checkpoint
--src-file : Source sentences (one per line)
--mt-file : MT outputs (one per line)
--ref-file : Reference translations (optional; enables ref-based mode)
--lp : Language-pair codes (e.g., en-de)
--output_dir : Output folder
--temperature : Generation temperature
--tp_size : Tensor parallel size
--dp_size : Data parallel size
--num-seqs : Max parallel sequences per step
--max-tokens : Max generation token numebrs
--gpu-memory-utilization : vLLM memory ratio (e.g. 0.9)

You can also call the CLI via Python:

python -m remedy_r.cli.score \
  --model ShaomuTan/Remedy-R-7B \
  ...

Reference-Free / QE Mode

If you don’t have references, just drop --ref-file and add --no-ref:

remedy-r-score \
  --model ShaomuTan/Remedy-R-7B \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --no-ref \
  --src-lang en \
  --tgt-lang de \
  --save-dir ./testcase \
  --cache-dir ./Models

The prompt automatically switches to reference-free quality estimation while keeping the same [0, 100] score scale.

Test-Time Scaling (TTS)

Remedy-R supports Test-Time Scaling by averaging multiple independent evaluation passes with different seeds:

remedy-r-score \
  --model ShaomuTan/Remedy-R-14B \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --ref-file ./testcase/de.ref \
  --src-lang en --tgt-lang de \
  --save-dir ./testcase_tts \
  --TTS \
  --best-of-n 4 \
  --seed 42

--TTS : Enable multi-pass evaluation
--best-of-n : Number of independent passes (e.g., 2–6)
Scores are averaged; the detailed per-pass scores can be optionally logged.

TTS typically improves segment-level pairwise accuracy and stabilizes scores for difficult segments.

🌐 Optional: vLLM Online Serving

To avoid re-loading the model for every scoring run, you can:

Start a local vLLM server (OpenAI-compatible):

remedy-r-serve \
  --model ShaomuTan/Remedy-R-14B \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9

Score via the server:

remedy-r-score \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --ref-file ./testcase/de.ref \
  --lp en-de \
  --save_metric_name Remedy-R-14B \
  --save-dir ./testcase_server \
  --server-url http://localhost:8000/v1

Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating LLM() in every process.

📄 Outputs

For each language pair SRC-TGT, Remedy-R writes:

results.jsonl
segment_scores.tsv
system_score.txt

📚 Citation

If you use Remedy-R or this codebase, please cite:

Arxiv coming soon...