Swallow Education Classifier

Model summary

This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:

Wiki-based classifier: trained on Japanese Wikipedia text in academic categories.
LLM-based classifier: trained on educational value annotations generated by LLMs.

The Wiki-based classifier is distributed under the CC BY-SA 4.0 license. The LLM-based classifier is distributed under the same license to the LLM used for annotation (Llama 3.1 Community License Agreement or Gemma Terms of Use).

These classifiers were employed as quality-filtering for the Swallow Corpus Version 2*, which was used to train the Llama 3.1 Swallow series. Our experiments demonstrated that educational quality-filtering based on the classifier scores effectively enhanced the LLM’s Japanese knowledge, even with the same computational budgets.

NOTE: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.

* A large Japanese web corpus extracted from Common Crawl

How to use

The Wiki-based classifier outputs a probability between 0 and 1, indicating how similar a given document is to Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four classes (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores, calculated as the weighted sum of class scores and their probabilities (ranging from 0 to 3) can be used as an educational score.

pip install numpy==1.26.4 fasttext

from huggingface_hub import hf_hub_download
import fasttext

# Example text
text = "Llama 3.1 Swallow\nLlama 3.1 SwallowはLlama 3.1の英語の能力を維持しながら、日本語の能力を強化した大規模言語モデル (8B, 70B) です。"
text = text.replace("\n", " ")

# If you use Wiki-based classifier
model = fasttext.load_model(hf_hub_download("tokyotech-llm/edu-classifier", "wiki.bin"))
res = model.predict(text, k=-1)

## Use the positive prediction probability as the educational score
edu_score = res[1][0] if res[0][0] == "__label__pos" else 1 - res[1][0]

# If you use LLM-based classifier
model = fasttext.load_model(
    hf_hub_download("tokyotech-llm/edu-classifier", "llm_llama.bin")
)
res = model.predict(text, k=-1)

## Use the weighted sum of the prediction probabilities as the educational score
edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])

Best practice

In our research, we have demonstrated that both classifiers were effective. However, we recommend using the LLM-based classifier if you want to assign appropriate educational scores to a broader range of documents. The Wiki-based classifier, designed to detect Wikipedia academic article-like content, often assigns scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value.

Training

Both classifiers were trained using fastText with 20 epochs on the training data. Character n-grams (n=2,3) were used as features. Word n-grams were not applied as they did not contribute to improving accuracy.

Wiki-based Classifier

We built this classifier by treating Wikipedia articles as positive examples of educational documents. Since not all articles, such as those about individuals, are necessarily “educational”, we extracted 37,399 Japanese Wikipedia articles from academic categories as positive examples for training data. We randomly sampled 37,399 documents from the Swallow Corpus Version 2 for the negative examples.

LLM-based Classifier

Inspired by FineWeb-Edu, we constructed the classifier through the following steps:

Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
Evaluate the educational value of the documents from step 1 using Llama 3.1 70B Instruct (or Gemma 2 27B IT) and a custom prompt. The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 4-point Likert scale.
Train a fastText classifier using the automatically scored documents from step 2 as training data. This classifier predicts the probability of each class (0, 1, 2, or 3).

Acknowledgments

This research is based on results obtained from a project, JPNP18002, commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and AIST policy-based budget project "R&D on Generative AI Foundation Models for the Physical Domain". In addition, the experiments of continual pre-training of LLMs was supported by the “Support Program for Building Large Language Models” of the AI Bridging Cloud Infrastructure (ABCI) developed and operated by the National Institute of Advanced Industrial Science and Technology (AIST).

Citation

The preprint can be downloaded here (in Japanese only)

@inproceedings{hattori-2025-swallow-v2,
  author = {服部 翔 and 岡崎 直観 and 水木 栄 and 藤井 一喜 and 中村 泰士 and 大井 聖也 and 塩谷 泰平 and 齋藤 幸史郎 and Youmi Ma and 前田 航希 and 岡本 拓己 and 石田 茂樹 and 横田 理央 and 高村 大也},
  title = {Swallowコーパスv2: 教育的な日本語ウェブコーパスの構築},
  booktitle = {言語処理学会第31回年次大会 (NLP2025)},
  year = {2025},
}

tokyotech-llm
/

edu-classifier