yulan-team
/

YuLan-Mini-Before-Annealing

Safetensors

optimizer_states

Model card Files Files and versions Community

IvanHU commited on Dec 27, 2024

Commit

f49ad04

1 Parent(s): 85d4dac

Update README

Browse files

Files changed (1) hide show

README.md +105 -0

README.md CHANGED Viewed

@@ -1,3 +1,108 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+# IMPORTANT NOTICE: THIS IS AN INTERMEDIATE CHECKPOINT, NOT THE FINAL MODEL
+Both [**YuLan-Mini**](https://huggingface.co/yulan-team/YuLan-Mini) and **YuLan-Mini-Intermediate-4K** were trained starting from this checkpoint.
+This version includes the optimizer, allowing you to resume training using the Hugging Face Trainer and DeepSpeed Universal Checkpoint.
+## Continual Training Tutorial
+### Step 1: Modify the `config.json`
+Due to the implementation of Hugging Face Trainer, certain parameters are stored in the `config.json` file and cannot be modified through the Trainer's command-line arguments. Therefore, you need to update these parameters in the `config.json` file first, particularly:
+- **`save_steps`**: The frequency of saving intermediate checkpoints.
+- **`train_batch_size`**: The batch size per GPU (equivalent to `per_device_train_batch_size` in the Trainer). We used a batch size of 1008 (approximately 4M tokens) during the stable training stage. Maintaining this same batch size is equally important for training effectiveness.
+Below is an example of a properly configured `config.json` file:
+```json
+{
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 0.0,
+  "eval_steps": 500,
+  "global_step": 0,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [],
+  "logging_steps": 3,
+  "max_steps": 0,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 0,
+  "save_steps": 250,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0,
+  "train_batch_size": 3,
+  "trial_name": null,
+  "trial_params": null
+}
+```
+### Step 2: Enable Universal Checkpointing in the DeepSpeed Configuration
+To ensure DeepSpeed Integration loads the Universal Checkpoint, you need to enable this feature in the DeepSpeed configuration JSON file.
+Here is an example of a ZeRO2 configuration with Universal Checkpointing enabled:
+```json
+{
+  "bf16": {
+    "enabled": "auto"
+  },
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 8e8,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 8e8,
+    "contiguous_gradients": true
+  },
+  "gradient_accumulation_steps": "auto",
+  "gradient_clipping": "auto",
+  "steps_per_print": 16,
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu": "auto",
+  "wall_clock_breakdown": false,
+  "dump_state": true,
+  "optimizer": {
+    "type": "AdamW",
+    "params": {
+      "lr": "auto",
+      "betas": "auto",
+      "eps": "auto",
+      "weight_decay": "auto"
+    }
+  },
+  "checkpoint": {
+    "load_universal": true
+  }
+}
+```
+### Step 3: Resume Training
+When calling `trainer.train`, include the `resume_from_checkpoint` argument to load the distributed optimizer state from the Universal Checkpoint and resume training.
+```python
+trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+```
+We provide an internal [training framework](https://github.com/RUC-GSAI/YuLan-Mini/tree/main/pretrain) for your reference, but you are free to choose other frameworks.