IvanHU commited on
Commit
f49ad04
·
1 Parent(s): 85d4dac

Update README

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md CHANGED
@@ -1,3 +1,108 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # IMPORTANT NOTICE: THIS IS AN INTERMEDIATE CHECKPOINT, NOT THE FINAL MODEL
6
+
7
+ Both [**YuLan-Mini**](https://huggingface.co/yulan-team/YuLan-Mini) and **YuLan-Mini-Intermediate-4K** were trained starting from this checkpoint.
8
+
9
+ This version includes the optimizer, allowing you to resume training using the Hugging Face Trainer and DeepSpeed Universal Checkpoint.
10
+
11
+ ## Continual Training Tutorial
12
+
13
+ ### Step 1: Modify the `config.json`
14
+
15
+ Due to the implementation of Hugging Face Trainer, certain parameters are stored in the `config.json` file and cannot be modified through the Trainer's command-line arguments. Therefore, you need to update these parameters in the `config.json` file first, particularly:
16
+
17
+ - **`save_steps`**: The frequency of saving intermediate checkpoints.
18
+ - **`train_batch_size`**: The batch size per GPU (equivalent to `per_device_train_batch_size` in the Trainer). We used a batch size of 1008 (approximately 4M tokens) during the stable training stage. Maintaining this same batch size is equally important for training effectiveness.
19
+
20
+ Below is an example of a properly configured `config.json` file:
21
+
22
+ ```json
23
+ {
24
+ "best_metric": null,
25
+ "best_model_checkpoint": null,
26
+ "epoch": 0.0,
27
+ "eval_steps": 500,
28
+ "global_step": 0,
29
+ "is_hyper_param_search": false,
30
+ "is_local_process_zero": true,
31
+ "is_world_process_zero": true,
32
+ "log_history": [],
33
+ "logging_steps": 3,
34
+ "max_steps": 0,
35
+ "num_input_tokens_seen": 0,
36
+ "num_train_epochs": 0,
37
+ "save_steps": 250,
38
+ "stateful_callbacks": {
39
+ "TrainerControl": {
40
+ "args": {
41
+ "should_epoch_stop": false,
42
+ "should_evaluate": false,
43
+ "should_log": false,
44
+ "should_save": true,
45
+ "should_training_stop": true
46
+ },
47
+ "attributes": {}
48
+ }
49
+ },
50
+ "total_flos": 0,
51
+ "train_batch_size": 3,
52
+ "trial_name": null,
53
+ "trial_params": null
54
+ }
55
+ ```
56
+
57
+ ### Step 2: Enable Universal Checkpointing in the DeepSpeed Configuration
58
+
59
+ To ensure DeepSpeed Integration loads the Universal Checkpoint, you need to enable this feature in the DeepSpeed configuration JSON file.
60
+
61
+ Here is an example of a ZeRO2 configuration with Universal Checkpointing enabled:
62
+
63
+ ```json
64
+ {
65
+ "bf16": {
66
+ "enabled": "auto"
67
+ },
68
+ "zero_optimization": {
69
+ "stage": 2,
70
+ "allgather_partitions": true,
71
+ "allgather_bucket_size": 8e8,
72
+ "overlap_comm": true,
73
+ "reduce_scatter": true,
74
+ "reduce_bucket_size": 8e8,
75
+ "contiguous_gradients": true
76
+ },
77
+ "gradient_accumulation_steps": "auto",
78
+ "gradient_clipping": "auto",
79
+ "steps_per_print": 16,
80
+ "train_batch_size": "auto",
81
+ "train_micro_batch_size_per_gpu": "auto",
82
+ "wall_clock_breakdown": false,
83
+ "dump_state": true,
84
+ "optimizer": {
85
+ "type": "AdamW",
86
+ "params": {
87
+ "lr": "auto",
88
+ "betas": "auto",
89
+ "eps": "auto",
90
+ "weight_decay": "auto"
91
+ }
92
+ },
93
+ "checkpoint": {
94
+ "load_universal": true
95
+ }
96
+ }
97
+ ```
98
+
99
+ ### Step 3: Resume Training
100
+
101
+ When calling `trainer.train`, include the `resume_from_checkpoint` argument to load the distributed optimizer state from the Universal Checkpoint and resume training.
102
+
103
+ ```python
104
+ trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
105
+ ```
106
+
107
+ We provide an internal [training framework](https://github.com/RUC-GSAI/YuLan-Mini/tree/main/pretrain) for your reference, but you are free to choose other frameworks.
108
+