Taiwan-ELM-270M / README.md
liswei's picture
Update model with 2x training data and more efficient vocabulary
ee1b4c2 verified
|
raw
history blame
1.07 kB
metadata
library_name: transformers
license: apache-2.0
datasets:
  - liswei/zhtw-news-and-articles-2B
base_model: apple/OpenELM-270M
language:
  - zh

Model Card for Chinese-OpenELM-270M

Continual pre-trained from apple/OpenELM-270M with liswei/zhtw-news-and-articles-2B:

  • Extended vocabulary from 32000 to 61758 tokens with additional Traditional Chinese characters.
    • Tokenizer is trained on liswei/zhtw-news-and-articles-2B and pruned from 96000 to 61758 tokens while maintaining 95% coverage on the pre-training dataset.
    • Additional token embeddings are initialized with the mean vector of existing embeddings.
  • Traditional Chinese perplexity = 1.6871 on held-out evaluation dataset.
  • Applied GaLore for efficient training with following hyperparameters:
    • Rank: 1024
    • Scale: 4.0
    • Update interval: 200
    • Layer-wise training: False