--- license: cc library_name: peft tags: - generated_from_trainer base_model: Lambent/cosmoem-4x1b model-index: - name: lora-out results: [] --- Opposite approach from prior version to learning rate and batch size. Stumbled across some writing on MoE training that suggested to increase learning rate and decrease batch size compared to the dense model. This version increases the learning rate by x5, and cuts the batch size in half. We did not fear the intermittent single step loss spike. Validation loss is lower than all so far. General capabilities comparison: This model: | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |---------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[CosMoE-AlpacaLight-v0.6](https://huggingface.co/Lambent/CosMoE-AlpacaLight-v0.6)| 23.3| 52.15| 38.57| 29.01| 35.76| Highest capability prior version MoE: | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |---------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[CosMoE-AlpacaLight-v0.3](https://huggingface.co/Lambent/CosMoE-AlpacaLight-v0.3)| 23.44| 51.93| 39.55| 27.99| 35.73| Dense model, trained: | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |-------------------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[CosmoAlpacaLight-1b](https://huggingface.co/Lambent/CosmoAlpacaLight-1b)| 24.28| 51.31| 40.33| 29.47| 36.35| Original model: | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |---------------------------------------------------------|------:|------:|---------:|-------:|------:| |[cosmo-1b](https://huggingface.co/HuggingFaceTB/cosmo-1b)| 22.97| 52.01| 38.02| 28.73| 35.43| Overall similar to v0.3. Didn't gain too much in capabilities compared to training the base model. Better than the base model at every overall capability evaluation, which is a directional improvement from v0.3 at least. Lower validation loss is also interesting. [Built with Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
See axolotl config axolotl version: `0.4.0` ```yaml base_model: Lambent/cosmoem-4x1b model_type: AutoModelForCausalLM tokenizer_type: LlamaTokenizer trust_remote_code: true load_in_8bit: false load_in_4bit: false strict: false datasets: - path: vicgalle/alpaca-gpt4 type: alpaca dataset_prepared_path: prepared-alpaca val_set_size: 0.05 output_dir: ./lora-out sequence_len: 2048 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true adapter: lora lora_model_dir: lora_r: 128 lora_alpha: 16 lora_dropout: 0.1 lora_target_linear: true lora_fan_in_fan_out: wandb_project: CosMoE-AlpacaLight-v0.61 wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 4 num_epochs: 1 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.001 lora_target_modules: - q_proj - k_proj - v_proj - o_proj - w1 - w2 - w3 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true loss_watchdog_threshold: 2.0 loss_watchdog_patience: 3 warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.002 fsdp: fsdp_config: special_tokens: ```

# lora-out This model is a fine-tuned version of [Lambent/cosmoem-4x1b](https://huggingface.co/Lambent/cosmoem-4x1b) on the alpaca-gpt4 dataset. It achieves the following results on the evaluation set: - Loss: 1.0477 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.001 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 20 - num_epochs: 1 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 1.1791 | 0.0 | 1 | 1.3304 | | 1.0505 | 0.25 | 325 | 1.0720 | | 0.9862 | 0.5 | 650 | 1.0589 | | 1.057 | 0.75 | 975 | 1.0477 | ### Framework versions - PEFT 0.9.0 - Transformers 4.40.0.dev0 - Pytorch 2.1.2+cu118 - Datasets 2.18.0 - Tokenizers 0.15.0