--- license: cc-by-sa-3.0 language: - de --- # xLSTM Model trained on German Wikipedia Research & development of an xLSTM model trained on German Wikipedia. The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks). For pretraining this xLSTM model, we this [fork](https://github.com/HallerPatrick/helibrunna) (from [Patrick Haller](https://huggingface.co/PatrickHaller)) of the awesome [Helibrunna](https://github.com/AI-Guru/helibrunna) library. Initially, we integrated xLSTM model training into Flair - for more information about this, please refer to the archived [flair-old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch of this repository. # Changelog - 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna). - 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch. # Training The current model was trained with commit `f66cc55` from the [`main` branch](https://github.com/HallerPatrick/helibrunna) of the forked Helibrunna repo. The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed. The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used. The following training configuration is used: ```yaml description: "Train a wikipedia xLSTM" training: model_name: "german_wikipedia" batch_size: 10 lr: 6e-4 lr_warmup_steps: 4584 lr_decay_until_steps: "auto" lr_decay_factor: 0.001 weight_decay: 0.1 amp_precision: bfloat16 weight_precision: float32 enable_mixed_precision: true num_epochs: 1 output_dir: "./output" save_every_step: 2000 log_every_step: 10 generate_every_step: 5000 wandb_project: "xlstm" gradient_clipping: "auto" # wandb_project: "lovecraftxlstm" model: num_blocks: 24 embedding_dim: 768 mlstm_block: mlstm: num_heads: 4 slstm_block: {} slstm_at: [] context_length: 512 dataset: output_path: "./output/german-wikipedia-dataset" hugging_face_id: ["stefan-it/dewiki-20230701"] split: "train" # Also subsetting is possible: "train[:100000]" shuffle: False seed: 42 tokenizer: type: "pretrained" pretrained_class: "LlamaTokenizer" pretrained_id: "meta-llama/Llama-2-7b-hf" ``` # Caveats Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters. Also downstream experiments are coming very soon. Unfortunately, there are nan's occuring in the training: ![Training Loss](training-loss.png)