lfm2-350M-med
Small medical fine-tune on top of LiquidAI’s LFM2-350M.
This checkpoint specializes the 350M LFM2 base for medical Q&A and tool-augmented search, using a light-weight recipe designed for laptops/edge boxes.
⚠️ Medical safety: This model is not a clinician. It may hallucinate and should not be used for diagnosis or treatment. Always seek qualified medical supervision.
TL;DR
- Base: LiquidAI/LFM2-350M.
- Training:- SFT on open-source medical data + tool-calling (search) traces
- DPO preference alignment using MedMCQA as a preference signal
- Post-merge with the base via Arcee Fusion (MergeKit) for controlled weight fusion
 
- Eval (author’s harness)  - MMLU-Pro: 19.46 (vs 18.76 base in same harness)
- IFEVAL: 52.595 (vs 61.72 base in same harness)
 Note: LFM2’s official IFEVAL uses a different internal harness and reports ~65 on IFEVAL for the base; numbers are not directly comparable across harnesses.
 
What’s inside
Base model: LFM2-350M
- Designed for on-device inference, with strong CPU latency and a ChatML-like template.
- Supports tool use with dedicated special tokens (<tool_call>,</tool_call>, etc.).
 See the base card for the full template and examples.
Specialization steps
- Domain SFT (medical + tools) - Instruction-style Q&A from open medical sources and synthetic conversions.
- Tool-use (search) supervised traces to teach function calling patterns.
 
- Preference alignment (DPO) - Direct Preference Optimization with MedMCQA-derived preferences to bias toward clinically reasonable short answers.
- Rationale: DPO is simple, stable at a small scale, and works well for short-form medical responses.
 
- Model fusion (Arcee Fusion) - Final merge uses Arcee Fusion in MergeKit, which selectively fuses parameters to avoid over-averaging and can be configured via merge_method: arcee_fusion.
 
- Final merge uses Arcee Fusion in MergeKit, which selectively fuses parameters to avoid over-averaging and can be configured via 
Intended use & limitations
Use: education, research.
Don’t use: any medical advice.
Evaluation
All results below were run with the author’s harness; they will differ from LiquidAI’s internal suite and Open LLM Leaderboard settings.
| Benchmark | lfm2-350M-med | LFM2-350M (same harness) | 
|---|---|---|
| MMLU-Pro | 19.46 | 18.76 | 
| IFEVAL | 52.595 | 61.72 | 
- MMLU-Pro raises difficulty with 10 choices and more reasoning-heavy items—small models typically drop vs standard MMLU, so small absolute movements are meaningful.
- IFEVAL measures verifiable instruction-following; scores depend heavily on prompt templates and verification scripts.
Quickstart (Transformers)
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mkurman/lfm2-350M-med"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")
messages = [
  {"role": "system", "content": "You are a careful medical assistant. Cite sources and warn that outputs are not medical advice."},
  {"role": "user", "content": "Briefly explain the difference between cellulitis and erysipelas."}
]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=256)
print(tok.decode(out[0], skip_special_tokens=True))
- Downloads last month
- 35
Model tree for mkurman/lfm2-350M-med
Base model
LiquidAI/LFM2-350M