FRANKENSTALLM 3B

⚠️ v2 모델 교체 공지 (2026-03-26)

v2 GGUF 및 safetensors 파일이 변환 과정의 오류로 **1.2B 모델(hidden_size=2048, 24 layers)**로 잘못 배포되었습니다. 2026-03-26에 올바른 **3B ORPO 체크포인트(hidden_size=3072, 28 layers, vocab_size=64256, byte-fallback 적용)**로 교체 완료했습니다. 이전에 다운로드한 v2 파일이 있다면 재다운로드를 권장합니다.

한국어 3B LLM을 처음부터 직접 만들었습니다 — 토크나이저 학습부터 사전학습, SFT, ORPO까지, 8× NVIDIA B200 GPU 위에서.


개발자	pathcosmos
파라미터	~24억 (weight tying 적용, 3B급)
언어	한국어 (주), 영어 (부)
라이선스	Apache 2.0
학습	3단계: 사전학습 → SFT → ORPO
하드웨어	8× NVIDIA B200 (FP8), 총 ~86시간

빠른 시작

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "pathcosmos/frankenstallm"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

inputs = tokenizer(
    "한국의 전통 음식 중 김치에 대해 설명해주세요.",
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=True,
        temperature=0.7,
        repetition_penalty=1.2,  # 권장
        top_p=0.9,
        max_new_tokens=512,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Ollama (GGUF)

# GGUF + Modelfile 다운로드
huggingface-cli download pathcosmos/frankenstallm \
  gguf/frankenstallm-3b-v2-Q4_K_M.gguf \
  gguf/Modelfile.3b-v2-Q4_K_M \
  --local-dir ./frankenstallm

# Modelfile 내 FROM 경로 수정 후 생성
ollama create frankenstallm -f ./frankenstallm/gguf/Modelfile.3b-v2-Q4_K_M

# 실행
ollama run frankenstallm

파일 다운로드 링크

모델 파일

파일	크기	설명	다운로드
`model.safetensors`	5.7 GB	HF Transformers 네이티브 (3B ORPO, byte-fallback)	Download
`config.json`	1 KB	모델 설정 (hidden=3072, 28L, vocab=64256)	Download
`tokenizer.json`	4 MB	토크나이저 (SentencePiece Unigram)	Download
`tokenizer.model`	1.4 MB	SentencePiece 모델 (GGUF 변환용)	Download
`sampling_config.json`	1 KB	권장 샘플링 파라미터	Download

GGUF (Ollama / llama.cpp)

파일	크기	양자화	다운로드
`frankenstallm-3b-v2-Q4_K_M.gguf`	1.8 GB	Q4_K_M (권장)	Download
`frankenstallm-3b-v2-Q8_0.gguf`	3.0 GB	Q8_0 (고품질)	Download
`frankenstallm-3b-v2-f16.gguf`	5.7 GB	F16 (무손실)	Download
`Modelfile.3b-v2-Q4_K_M`	1 KB	Ollama Modelfile (Q4)	Download
`Modelfile.3b-v2-Q8_0`	1 KB	Ollama Modelfile (Q8)	Download

v1 GGUF (byte-fallback 미적용)도 gguf/frankenstallm-3b-*.gguf로 제공되지만, v2 사용을 권장합니다.

학습 데이터 (SFT / ORPO 재현용)

파일	크기	용도	다운로드
`train_filtered.jsonl`	7.5 GB	SFT 학습 데이터 (24개 소스, 240만 샘플, 필터링 완료)	Download
`val_filtered.jsonl`	157 MB	SFT 검증 데이터	Download
`combined_preference.jsonl`	2.6 GB	ORPO 학습 데이터 (7개 소스 통합, 63만 쌍)	Download

ORPO Preference 데이터 개별 소스 (7종)

파일	크기	다운로드
`nayohan_preference-collection-ko-full.jsonl`	4.9 GB	Download
`heegyu_orca-math-korean-preference-cleaned.jsonl`	1.6 GB	Download
`kuotient_orca-math-korean-dpo-pairs.jsonl`	750 MB	Download
`maywell_ko_Ultrafeedback_binarized.jsonl`	394 MB	Download
`tellang_yeji-preference-ko-v1.jsonl`	171 MB	Download
`jojo0217_korean_rlhf_dataset.jsonl`	137 MB	Download
`lemon-mint_korean-realqa-reasoning-v01-preference.jsonl`	58 MB	Download

데이터 파이프라인 스크립트

파일	설명
`prepare_sft_data.py`	HF 데이터셋 → JSONL 정규화 (Alpaca 포맷)
`filter_sft_v2.py`	SFT 품질 필터링 (중복 제거, 반복률 필터)
`prepare_preference_combined.py`	Preference 데이터 통합 (DPO/ORPO용)
`tokenize_extra.py`	대용량 데이터 병렬 토크나이징
`sft_dataset.py`	SFT 데이터셋 로더 (Alpaca/대화 포맷)
`dataset.py`	사전학습 데이터셋 로더 (memmap .bin)
`build_korean_dataset.sh`	한국어 데이터 전체 파이프라인

Phase별 보고서

보고서	내용
`PROJECT_COMPLETION_REPORT`	프로젝트 최종 완료 보고서
`ORPO_EVALUATION_REPORT`	ORPO 10차원 종합 평가
`ORPO_TRAINING_JOURNEY`	ORPO 학습 여정 (HP sweep, 디버깅)
`SFT_COMPLETION_AND_EVAL`	SFT 완료 및 평가
`3B_BASE_EVALUATION`	사전학습 베이스 모델 평가
`Phase0_Optimization`	FP8 최적화 보고서

모델 특징

처음부터 만든 한국어 토크나이저: SentencePiece Unigram, 64K 어휘, 한국어 문자 커버리지 99.95%
3단계 학습 파이프라인: 사전학습 (57K 스텝, ~600억 토큰) → SFT (25.5K 스텝, 240만 샘플) → ORPO (10K 스텝, 63만 선호도 쌍)
B200 FP8 네이티브 학습: TransformerEngine MXFP8 — BF16 대비 이론적 2배 처리량
GGUF 배포 지원: Q4_K_M (1.8GB), Q8_0 (3.0GB), F16 (5.7GB) + Ollama Modelfile 제공

아키텍처

구성 요소	값
구조	Decoder-only Transformer (LLaMA 스타일)
Hidden size	3,072
레이어 수	28
어텐션 헤드	24
KV 헤드	8 (GQA 3:1)
FFN 차원	8,192 (SwiGLU)
어휘 크기	64,256 (byte-fallback 적용)
컨텍스트 길이	4,096 (학습 시 2,048)
위치 인코딩	RoPE (θ=500,000)
정규화	Pre-norm RMSNorm
어텐션 구현	FlashAttention-2
정밀도	FP8 (TransformerEngine MXFP8)
Weight tying	적용 (embedding ↔ lm_head)

학습 파이프라인

Phase 1: 사전학습

항목	값
스텝 수	57,000
최종 loss	1.466
학습 토큰	~600억 (385억 고유 × ~1.5 에폭)
소요 시간	~63시간
데이터	CC-100 KO, HPLT KO, C4 KO, 나무위키, 위키피디아 KO, Cosmopedia (EN)
배치 크기	5 × 8 GPU × 8 accum × 2,048 seq = ~65만 토큰/스텝

Phase 2: SFT (지도 미세조정)

항목	값
스텝 수	25,500 (77.3% 지점에서 조기 종료)
최적 val_loss	1.8851 (step 23,000)
소요 시간	~15.5시간
데이터	24개 소스, 243만 9,397 샘플 (7.48 GB)
구성	SFT 70% + 사전학습 리플레이 30% (치명적 망각 방지)
지식 망각률	0.9% (19개 데이터셋 기준)

Phase 3: ORPO (선호도 최적화)

항목	값
스텝 수	9,997 (조기 수렴)
최적 eval_loss	1.625
선호도 정확도	76.02%
보상 마진	0.6100
소요 시간	~7시간
데이터	한국어 HF 데이터셋 7종, ~63만 선호도 쌍
하이퍼파라미터	beta=0.25, lr=1.2e-5, eff_batch=128

총 학습 시간: 8× B200에서 약 86시간

벤치마크

학습 단계별 성능 변화 (Base → SFT → ORPO)

벤치마크	Base	SFT	ORPO	변화 (Base→ORPO)
KoBEST 평균 (0-shot)	43.7%	43.3%	52.8%	+9.1pp
KoBEST COPA	49.3%	48.6%	63.9%	+14.6pp
KoBEST HellaSwag-KO	21.6%	19.8%	38.0%	+16.4pp
KoBEST SentiNeg	48.6%	49.1%	62.5%	+13.9pp
KoBEST BoolQ	50.3%	50.1%	50.6%	+0.3pp
PIQA	52.5%	52.6%	59.9%	+7.3pp
ARC-Easy	25.6%	25.9%	36.0%	+10.4pp
HAE-RAE	19.7%	19.9%	21.8%	+2.1pp
HellaSwag EN	26.2%	26.1%	29.2%	+3.0pp
Greedy 3-gram 반복률	61.0%	73.0%	30.9%	-30.1pp
EOS 종료율	0%	60%	67%	+67pp
PPL 망각률	—	0.9%	4.1%	15% 이내 ✅

3B급 모델 비교 (Ollama, 35개 테스트)

모델	파라미터	한국어 NLU	지식	지시 수행	추론	평균 점수
Qwen 2.5 3B	3B	100.0	20.8	55.6	62.5	63.4
Phi-4 Mini	3.8B	66.7	29.2	33.3	87.5	60.6
FRANKENSTALLM 3B	3B	100.0	75.0	66.7	50.0	46.7

FRANKENSTALLM은 한국어 NLU (Qwen과 동률), 한국어 지식 (75.0 vs 20.8/29.2), 지시 수행 (66.7 vs 55.6/33.3)에서 앞섭니다.

추론 속도 (Ollama, Q4_K_M)

모델	평균 TTFT	TPS	비고
FRANKENSTALLM 3B	16.7ms	142.5	가장 빠름
Phi-4 Mini 3.8B	25.6ms	100.4
Qwen 2.5 3B	28.2ms	93.8

Perplexity 보존율 (ORPO 지식 유지)

데이터셋	Base PPL	ORPO PPL	망각률
Korean C4	5.72	5.87	+2.7%
Korean Wiki	11.84	12.21	+3.2%
최대 망각률	—	—	4.1% ✅

학습 데이터

사전학습 (~385억 토큰)

분류	소스	추정 토큰 수
한국어 웹 크롤	C4 KO, CC-100 KO, HPLT KO	~172억
한국어 백과사전	위키피디아 KO, 나무위키 (2개 버전)	~28억
영어 교육	Cosmopedia (Stories, Web, Stanford, WikiHow, OpenStax, Khan)	~57억
영어 수학·과학	AutoMathText, OpenWebMath, Proof-Pile-2	~85억
코드	StarCoder (필터링)	~43억

SFT (240만 샘플, 24개 소스)

영역	비율	주요 데이터셋
추론/CoT	38%	reasoning_r1_1.4m, magpie_reasoning
한국어 지시문	23%	korean_instruction_mix, open_korean_instructions, kullm_v2
영어 일반	16%	openhermes_2.5, ultrachat_200k
수학	12%	NuminaMath-CoT, orca-math-ko
대화/코드/기타	11%	smol-koreantalk, Evol-Instruct-Code-80k-ko

ORPO (~63만 선호도 쌍, 7개 소스)

데이터셋	용량	영역
nayohan/preference-collection-ko-full	4.9GB	일반 선호도
heegyu/orca-math-korean-preference-cleaned	1.6GB	수학 추론
kuotient/orca-math-korean-dpo-pairs	750MB	수학 DPO
maywell/ko_Ultrafeedback_binarized	394MB	피드백 정렬
tellang/yeji-preference-ko-v1	171MB	일반 선호도
jojo0217/korean_rlhf_dataset	137MB	RLHF 쌍
lemon-mint/korean-realqa-reasoning-v01-preference	58MB	QA 추론

GGUF & Ollama

제공 양자화 파일

파일	크기	설명
`gguf/frankenstallm-3b-v2-Q4_K_M.gguf`	1.8GB	권장 — 크기 대비 최적 품질
`gguf/frankenstallm-3b-v2-Q8_0.gguf`	3.0GB	높은 품질
`gguf/frankenstallm-3b-v2-f16.gguf`	5.7GB	전체 정밀도
`model.safetensors`	5.7GB	Transformers 네이티브 (3B ORPO best, byte-fallback 수정, vocab=64256)

권장 샘플링 파라미터

파라미터	값	비고
`temperature`	0.7	한국어 생성 품질 최적
`repeat_penalty`	1.2	필수 — 미적용 시 greedy 반복률 30.9%
`top_p`	0.9	Nucleus 샘플링
`top_k`	50	Top-k 후보 수
`max_tokens`	512	최대 생성 길이
`num_ctx`	4096	컨텍스트 윈도우 (초과 금지)

⚠️ 반드시 repeat_penalty >= 1.2를 사용하세요. 적용하면 반복률이 0% 로 떨어집니다. 미적용 시 greedy 디코딩에서 ~31% 3-gram 반복이 발생합니다.

제한 사항

영어 성능 제한: MMLU-EN ~23%, HellaSwag-EN ~29% — 한국어 특화 모델입니다
코드 생성: 거의 불가능 (학습 데이터에 코드 비중이 낮음)
Greedy 반복: repeat_penalty 미사용 시 30.9% 3-gram 반복 — 반드시 repeat_penalty >= 1.2 사용
안전성: 안전 정렬(safety alignment) 데이터가 학습에 포함되지 않았으므로 적절한 가드레일과 함께 사용하세요
규모 차이: 수조 토큰으로 학습된 상용 3B 모델 대비 ~600억 토큰으로 학습 — 전반적 벤치마크 점수는 낮을 수 있습니다

하드웨어 및 학습 환경

구성 요소	사양
GPU	8× NVIDIA B200 (183GB HBM3e × 8, 총 ~1.47TB)
FP8 연산	2,250 TFLOPS/GPU (총 18,000 TFLOPS)
인터커넥트	NVLink 5.0, NVSwitch all-to-all mesh
CPU	2× AMD EPYC 9365 (72코어, Zen 5)
RAM	2.21 TB DDR5
PyTorch	2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA 커스텀)
TransformerEngine	2.10.0
FlashAttention	2.7.4
NCCL	2.28.9
CUDA	13.1
총 학습 시간	~86시간 (사전학습 63h + SFT 15.5h + ORPO 7h)

인용

@misc{frankenstallm2026,
  title={FRANKENSTALLM: A Korean 3B LLM Built From Scratch on B200 GPUs},
  author={pathcosmos},
  year={2026},
  url={https://huggingface.co/pathcosmos/frankenstallm},
  note={3-phase training (Pretrain, SFT, ORPO) with FP8 on 8x NVIDIA B200}
}

링크 및 연락처

GitHub: pathcosmos/FRANKENSTALLM — 전체 소스코드, 학습 스크립트, 빌더 로그
HuggingFace: pathcosmos/frankenstallm
연락처: pathcosmos@gmail.com

감사의 글

이 프로젝트는 과학기술정보통신부의 「첨단 GPU 활용 지원 사업」 (과학기술정보통신부 공고 제2025-1068호)을 통해 제공된 GPU 컴퓨팅 자원을 활용하여 수행되었습니다.

국가 AI컴퓨팅자원 지원포털: https://aiinfrahub.kr

주관: 과학기술정보통신부 (MSIT), 정보통신산업진흥원 (NIPA)

운영: 한국정보통신진흥협회 (KAIT)

대한민국 정부의 AI 인프라 지원 사업 덕분에 8× NVIDIA B200 GPU 환경에서 한국어 3B LLM을 처음부터 학습할 수 있었습니다. 국가 차원의 AI 컴퓨팅 자원 지원에 깊이 감사드립니다.

🇺🇸 English version below

FRANKENSTALLM 3B

⚠️ v2 Model Replacement Notice (2026-03-26)

The v2 GGUF and safetensors files were incorrectly deployed as a 1.2B model (hidden_size=2048, 24 layers) due to a conversion pipeline error. On 2026-03-26, they were replaced with the correct 3B ORPO checkpoint (hidden_size=3072, 28 layers, vocab_size=64256, byte-fallback applied). If you downloaded v2 files before this date, please re-download.

A Korean 3B LLM built entirely from scratch — tokenizer, pretraining, SFT, and ORPO — on 8× NVIDIA B200 GPUs.


Developer	pathcosmos
Parameters	~2.4B (3B-class with weight tying)
Languages	Korean (primary), English (secondary)
License	Apache 2.0
Training	3-phase: Pretrain → SFT → ORPO
Hardware	8× NVIDIA B200 (FP8), ~86 hours total

Quick Start

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "pathcosmos/frankenstallm"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

inputs = tokenizer(
    "한국의 전통 음식 중 김치에 대해 설명해주세요.",
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=True,
        temperature=0.7,
        repetition_penalty=1.2,  # recommended
        top_p=0.9,
        max_new_tokens=512,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Ollama (GGUF)

# Download GGUF + Modelfile
huggingface-cli download pathcosmos/frankenstallm \
  gguf/frankenstallm-3b-v2-Q4_K_M.gguf \
  gguf/Modelfile.3b-v2-Q4_K_M \
  --local-dir ./frankenstallm

# Fix FROM path in Modelfile, then create
ollama create frankenstallm -f ./frankenstallm/gguf/Modelfile.3b-v2-Q4_K_M

# Run
ollama run frankenstallm

File Downloads

Model Files

File	Size	Description	Download
`model.safetensors`	5.7 GB	HF Transformers native (3B ORPO, byte-fallback)	Download
`config.json`	1 KB	Model config (hidden=3072, 28L, vocab=64256)	Download
`tokenizer.json`	4 MB	Tokenizer (SentencePiece Unigram)	Download
`tokenizer.model`	1.4 MB	SentencePiece model (for GGUF conversion)	Download

GGUF (Ollama / llama.cpp)

File	Size	Quantization	Download
`frankenstallm-3b-v2-Q4_K_M.gguf`	1.8 GB	Q4_K_M (Recommended)	Download
`frankenstallm-3b-v2-Q8_0.gguf`	3.0 GB	Q8_0 (High quality)	Download
`frankenstallm-3b-v2-f16.gguf`	5.7 GB	F16 (Lossless)	Download

Training Data (for SFT / ORPO reproduction)

File	Size	Purpose	Download
`train_filtered.jsonl`	7.5 GB	SFT training data (24 sources, 2.4M samples, filtered)	Download
`val_filtered.jsonl`	157 MB	SFT validation data	Download
`combined_preference.jsonl`	2.6 GB	ORPO training data (7 sources, 630K pairs)	Download

Individual ORPO Preference Sources (7 datasets)

File	Size	Download
`nayohan_preference-collection-ko-full.jsonl`	4.9 GB	Download
`heegyu_orca-math-korean-preference-cleaned.jsonl`	1.6 GB	Download
`kuotient_orca-math-korean-dpo-pairs.jsonl`	750 MB	Download
`maywell_ko_Ultrafeedback_binarized.jsonl`	394 MB	Download
`tellang_yeji-preference-ko-v1.jsonl`	171 MB	Download
`jojo0217_korean_rlhf_dataset.jsonl`	137 MB	Download
`lemon-mint_korean-realqa-reasoning-v01-preference.jsonl`	58 MB	Download

Data Pipeline Scripts

File	Description
`prepare_sft_data.py`	HF datasets → JSONL normalization (Alpaca format)
`filter_sft_v2.py`	SFT quality filtering (dedup, repetition filter)
`prepare_preference_combined.py`	Preference data merging (DPO/ORPO format)
`tokenize_extra.py`	Large-scale parallel tokenization
`sft_dataset.py`	SFT dataset loader (Alpaca/conversation format)

Phase Reports

Report	Content
`PROJECT_COMPLETION_REPORT`	Final project completion report
`ORPO_EVALUATION_REPORT`	ORPO 10-dimension evaluation
`ORPO_TRAINING_JOURNEY`	ORPO training journey (HP sweep, debugging)
`SFT_COMPLETION_AND_EVAL`	SFT completion and evaluation
`3B_BASE_EVALUATION`	Pretrained base model evaluation

Model Highlights

From-scratch Korean tokenizer: SentencePiece Unigram, 64K vocab, 99.95% Korean character coverage
3-phase training pipeline: Pretrain (57K steps, ~60B tokens) → SFT (25.5K steps, 2.4M samples) → ORPO (10K steps, 630K preference pairs)
B200 FP8 native training: TransformerEngine MXFP8 on NVIDIA B200 — 2× theoretical throughput vs BF16
GGUF deployment ready: Q4_K_M (1.8GB), Q8_0 (3.0GB), F16 (5.7GB) with optimized Ollama Modelfiles

Architecture

Component	Value
Type	Decoder-only Transformer (LLaMA-style)
Hidden size	3,072
Layers	28
Attention heads	24
KV heads	8 (GQA 3:1)
FFN dim	8,192 (SwiGLU)
Vocab size	64,256 (byte-fallback applied)
Context length	4,096 (trained at 2,048)
Position encoding	RoPE (θ=500,000)
Normalization	Pre-norm RMSNorm
Attention impl	FlashAttention-2
Precision	FP8 (MXFP8 via TransformerEngine)
Weight tying	Yes (embedding ↔ lm_head)

Training Pipeline

Phase 1: Pretraining

Detail	Value
Steps	57,000
Final loss	1.466
Tokens seen	~60B (38.5B unique × ~1.5 epochs)
Duration	~63 hours
Data	CC-100 KO, HPLT KO, C4 KO, NamuWiki, Wikipedia KO, Cosmopedia (EN)
Batch size	5 × 8 GPU × 8 accum × 2,048 seq = ~655K tok/step

Phase 2: Supervised Fine-Tuning (SFT)

Detail	Value
Steps	25,500 (early stop at 77.3%)
Best val_loss	1.8851 (step 23,000)
Duration	~15.5 hours
Data	2,439,397 samples from 24 sources (7.48 GB)
Mix	70% SFT + 30% pretrain replay (catastrophic forgetting prevention)
Knowledge forgetting	0.9% (19 datasets)

Phase 3: ORPO (Odds Ratio Preference Optimization)

Detail	Value
Steps	9,997 (early convergence)
Best eval_loss	1.625
Preference accuracy	76.02%
Reward margin	0.6100
Duration	~7 hours
Data	~630K preference pairs from 7 Korean HF datasets
Hyperparams	beta=0.25, lr=1.2e-5, eff_batch=128

Total training time: ~86 hours on 8× B200

Benchmarks

Training Phase Progression (Base → SFT → ORPO)

Benchmark	Base	SFT	ORPO	Δ (Base→ORPO)
KoBEST Avg (0-shot)	43.7%	43.3%	52.8%	+9.1pp
KoBEST COPA	49.3%	48.6%	63.9%	+14.6pp
KoBEST HellaSwag-KO	21.6%	19.8%	38.0%	+16.4pp
KoBEST SentiNeg	48.6%	49.1%	62.5%	+13.9pp
KoBEST BoolQ	50.3%	50.1%	50.6%	+0.3pp
PIQA	52.5%	52.6%	59.9%	+7.3pp
ARC-Easy	25.6%	25.9%	36.0%	+10.4pp
HAE-RAE	19.7%	19.9%	21.8%	+2.1pp
HellaSwag EN	26.2%	26.1%	29.2%	+3.0pp
Greedy 3-gram repetition	61.0%	73.0%	30.9%	-30.1pp
EOS termination rate	0%	60%	67%	+67pp
PPL forgetting	—	0.9%	4.1%	within 15% ✅

3B-class Model Comparison (Ollama, 35 tests)

Model	Params	Korean NLU	Knowledge	Instruction	Reasoning	Avg Score
Qwen 2.5 3B	3B	100.0	20.8	55.6	62.5	63.4
Phi-4 Mini	3.8B	66.7	29.2	33.3	87.5	60.6
FRANKENSTALLM 3B	3B	100.0	75.0	66.7	50.0	46.7

FRANKENSTALLM leads in Korean NLU (tied with Qwen), Korean Knowledge (75 vs 20.8/29.2), and Instruction Following (66.7 vs 55.6/33.3).

Inference Speed (Ollama, Q4_K_M)

Model	Avg TTFT	TPS	Note
FRANKENSTALLM 3B	16.7ms	142.5	Fastest
Phi-4 Mini 3.8B	25.6ms	100.4
Qwen 2.5 3B	28.2ms	93.8

Perplexity Preservation (ORPO Knowledge Retention)

Dataset	Base PPL	ORPO PPL	Forgetting
Korean C4	5.72	5.87	+2.7%
Korean Wiki	11.84	12.21	+3.2%
Max forgetting	—	—	4.1% ✅

Training Data

Pretraining (~38.5B tokens)

Category	Sources	Est. Tokens
Korean Web Crawl	C4 KO, CC-100 KO, HPLT KO	~17.2B
Korean Encyclopedia	Wikipedia KO, NamuWiki (2 versions)	~2.8B
English Educational	Cosmopedia (Stories, Web, Stanford, WikiHow, OpenStax, Khan)	~5.7B
English Math/Science	AutoMathText, OpenWebMath, Proof-Pile-2	~8.5B
Code	StarCoder (filtered)	~4.3B

SFT (2.4M samples, 24 sources)

Domain	Share	Key Datasets
Reasoning/CoT	38%	reasoning_r1_1.4m, magpie_reasoning
Korean Instructions	23%	korean_instruction_mix, open_korean_instructions, kullm_v2
English General	16%	openhermes_2.5, ultrachat_200k
Math	12%	NuminaMath-CoT, orca-math-ko
Dialog/Code/Other	11%	smol-koreantalk, Evol-Instruct-Code-80k-ko

ORPO (~630K preference pairs, 7 sources)

Dataset	Size	Domain
nayohan/preference-collection-ko-full	4.9GB	General preference
heegyu/orca-math-korean-preference-cleaned	1.6GB	Math reasoning
kuotient/orca-math-korean-dpo-pairs	750MB	Math DPO
maywell/ko_Ultrafeedback_binarized	394MB	Feedback alignment
tellang/yeji-preference-ko-v1	171MB	General preference
jojo0217/korean_rlhf_dataset	137MB	RLHF pairs
lemon-mint/korean-realqa-reasoning-v01-preference	58MB	QA reasoning

GGUF & Ollama

Available Quantizations

File	Size	Description
`gguf/frankenstallm-3b-v2-Q4_K_M.gguf`	1.8GB	Recommended — best size/quality balance
`gguf/frankenstallm-3b-v2-Q8_0.gguf`	3.0GB	Higher quality
`gguf/frankenstallm-3b-v2-f16.gguf`	5.7GB	Full precision
`model.safetensors`	5.7GB	Transformers native (3B ORPO best, byte-fallback fixed, vocab=64256)

Recommended Sampling Parameters

Parameter	Value	Notes
`temperature`	0.7	Optimal for Korean generation quality
`repeat_penalty`	1.2	Required — without it, greedy repetition is 30.9%
`top_p`	0.9	Nucleus sampling
`top_k`	50	Top-k candidates
`max_tokens`	512	Max generation length
`num_ctx`	4096	Context window (do not exceed)

⚠️ Always use repeat_penalty >= 1.2. With it, repetition drops to 0%. Without it, greedy decoding produces ~31% 3-gram repetition.

Limitations

English performance is limited: MMLU-EN ~23%, HellaSwag-EN ~29% — this is a Korean-focused model
Code generation: Near zero capability (limited code in training data)
Greedy repetition: 30.9% 3-gram repetition without repeat_penalty — always use sampling with repeat_penalty >= 1.2
Safety: Safety alignment data was not included in training; use with appropriate guardrails
Scale gap: Compared to commercial 3B models trained on trillions of tokens, this model was trained on ~60B tokens — expect lower overall benchmark scores

Hardware & Training Environment

Component	Specification
GPU	8× NVIDIA B200 (183GB HBM3e each, ~1.47TB total)
FP8 Compute	2,250 TFLOPS/GPU (18,000 TFLOPS total)
Interconnect	NVLink 5.0, NVSwitch all-to-all mesh
CPU	2× AMD EPYC 9365 (72 cores, Zen 5)
RAM	2.21 TB DDR5
PyTorch	2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA custom)
TransformerEngine	2.10.0
FlashAttention	2.7.4
NCCL	2.28.9
CUDA	13.1
Total training	~86 hours (Pretrain 63h + SFT 15.5h + ORPO 7h)

Citation

@misc{frankenstallm2026,
  title={FRANKENSTALLM: A Korean 3B LLM Built From Scratch on B200 GPUs},
  author={pathcosmos},
  year={2026},
  url={https://huggingface.co/pathcosmos/frankenstallm},
  note={3-phase training (Pretrain, SFT, ORPO) with FP8 on 8x NVIDIA B200}
}

Links & Contact

GitHub: pathcosmos/FRANKENSTALLM — Full source code, training scripts, and builder's log
HuggingFace: pathcosmos/frankenstallm
Contact: pathcosmos@gmail.com

Related Projects

EVAFRILL-Mo | 🤗 HuggingFace — Hybrid Mamba-2 + Transformer sister project (2.94B params). While FRANKENSTALLM uses a pure Transformer architecture, EVAFRILL-Mo adopts Mamba-2 SSM + sparse Transformer attention. Both share the same tokenizer and training infrastructure.

Acknowledgment

This project was conducted using GPU computing resources provided through the "Advanced GPU Utilization Support Program" (MSIT Notice No. 2025-1068) by the Ministry of Science and ICT (MSIT) of the Republic of Korea.

National AI Computing Resource Support Portal: https://aiinfrahub.kr

Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)

Operated by: Korea Association of Information & Telecommunication (KAIT)

We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B LLM from scratch on 8× NVIDIA B200 GPUs.

Downloads last month: 1,980

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

16-bit

Datasets used to train pathcosmos/frankenstallm

Evaluation results

Average on KoBEST (0-shot)
self-reported

52.750
COPA on KoBEST (0-shot)
self-reported

63.900
HellaSwag-KO on KoBEST (0-shot)
self-reported

38.000
SentiNeg on KoBEST (0-shot)
self-reported

62.500
BoolQ on KoBEST (0-shot)
self-reported

50.600
WiC on KoBEST (0-shot)
self-reported

48.800
Average on HAE-RAE (0-shot)
self-reported

21.810
Accuracy on PIQA (0-shot)
self-reported

59.900
Accuracy on ARC-Easy (0-shot)
self-reported

36.000