HKELECTRA - ELECTRA Pretrained Models for Hong Kong Content

This repository contains pretrained ELECTRA models trained on Hong Kong Cantonese and Traditional Chinese content, focused on studying diglossia effects for NLP modeling.

The repo includes:

generator/ : HuggingFace Transformers format generator model for masked token prediction.
discriminator/ : HuggingFace Transformers format discriminator model for replaced token detection.
tf_checkpoint/ : Original TensorFlow checkpoint from pretraining (requires TensorFlow to load).
runs/ : TensorBoard log of pretraining.

Note: Because this repo contains multiple models with different purposes, there is no pipeline_tag. Users should select the appropriate model and pipeline for their use case. TensorFlow checkpoint requires TensorFlow >= 2.X to load manually.

This model is also available at Zenodo: https://doi.org/10.5281/zenodo.16889492

Model Details

Model Description

Architecture: ELECTRA (small/base/large)
Pretraining: from scratch (no base model)
Languages: Hong Kong Cantonese, Traditional Chinese
Intended Use: Research, feature extraction, masked token prediction
License: cc-by-4.0

Usage Examples

Load Generator (Masked LM)

from transformers import ElectraTokenizer, ElectraForMaskedLM, pipeline

tokenizer = ElectraTokenizer.from_pretrained("SolarisCipher/HKELECTRA/generator/small")
model = ElectraForMaskedLM.from_pretrained("SolarisCipher/HKELECTRA/generator/small")

unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
unmasker("從中環[MASK]到尖沙咀。")

Load Discriminator (Feature Extraction / Replaced Token Detection)

from transformers import ElectraTokenizer, ElectraForPreTraining

tokenizer = ElectraTokenizer.from_pretrained("SolarisCipher/HKELECTRA/discriminator/small")
model = ElectraForPreTraining.from_pretrained("SolarisCipher/HKELECTRA/discriminator/small")

inputs = tokenizer("從中環坐車到[MASK]。", return_tensors="pt")
outputs = model(**inputs)  # logits for replaced token detection

Citation

If you use this model in your work, please cite our dataset and the original research:

Dataset (Upstream SQL Dump)

@dataset{yung_2025_16875235,
  author       = {Yung, Yiu Cheong},
  title        = {HK Web Text Corpus (MySQL Dump, raw version)},
  month        = aug,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.16875235},
  url          = {https://doi.org/10.5281/zenodo.16875235},
}

Dataset (Cleaned Corpus)

@dataset{yung_2025_16882351,
  author       = {Yung, Yiu Cheong},
  title        = {HK Content Corpus (Cantonese \& Traditional Chinese)},
  month        = aug,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.16882351},
  url          = {https://doi.org/10.5281/zenodo.16882351},
}

Research Paper

@article{10.1145/3744341,
author = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu},
title = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content},
year = {2025},
issue_date = {July 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {24},
number = {7},
issn = {2375-4699},
url = {https://doi.org/10.1145/3744341},
doi = {10.1145/3744341},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = jul,
articleno = {71},
numpages = {16},
keywords = {Hong Kong, diglossia, ELECTRA, language modeling}
}

SolarisCipher
/

HKELECTRA