MADLAD-400 3B-MT — MLX (Apple Silicon)

Quantized MLX port of google/madlad400-3b-mt for on-device, many-to-many translation across 400+ languages on Apple Silicon. Apache 2.0.

Architecture

T5 v1.1 encoder-decoder with relative position bias:

32 encoder + 32 decoder layers
d_model = 2048, d_kv = 128, num_heads = 16, d_ff = 16384
Gated GeLU FFN (wi_0, wi_1, wo)
RMSNorm pre-norm, no biases
Relative position bias (32 buckets, max distance 128) — first layer of each stack
Separate lm_head (NOT tied to embeddings)
SentencePiece vocabulary, 256,512 tokens (includes 400+ <2xx> target-language tokens)

Variants

Variant	Size	Path
INT4	~1.6 GB	`int4/model.safetensors`
INT8	~3.1 GB	`int8/model.safetensors`

Each variant includes config.json, tokenizer.json, tokenizer_config.json, special_tokens_map.json, and spiece.model.

Usage

import MADLADTranslation

let translator = try await MADLADTranslator.fromPretrained(quantization: .int4)
let es = try translator.translate("Hello, how are you?", to: "es")
// → "Hola, ¿cómo estás?"

let zh = try translator.translate("Where is the library?", to: "zh")
// → "图书馆在哪里？"

The target language is the only required parameter — MADLAD auto-detects the source language from the input text. Specify it as an ISO 639-1 code (or any of MADLAD's supported language tags); the tokenizer turns it into a leading <2{lang}> token.

CLI:

audio translate "Good morning" --to fr
audio transcribe meeting.wav | audio translate --to es

Part of the soniqo speech toolkit for Apple Silicon.

Conversion

Quantized directly from google/madlad400-3b-mt using mx.quantize() (group_size=64). The duplicate encoder/decoder.embed_tokens.weight keys are dropped — both encoder and decoder reuse shared.weight directly. Linear projections (q/k/v/o, wi_0/wi_1/wo, lm_head, shared) are quantized; RMSNorm scales and the relative-position-bias table stay as fp16.