Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ However, the widespread adoption of ReLU-based models in the LLM field remains l
|
|
8 |
|
9 |
## Model Architecture
|
10 |
|
11 |
-
|
12 |
|
13 |
```Python
|
14 |
class BambooMLP(nn.Module):
|
@@ -30,7 +30,7 @@ class BambooMLP(nn.Module):
|
|
30 |
|
31 |
In this section, we introduce the details of training our model, including types of data used, and hyperparameters.
|
32 |
|
33 |
-
We initialized the model weights to Mistral's model weights and modified the FFN structure to the
|
34 |
|
35 |
**First phase**: For the proportion of training corpus, we followed the data mix ratio and sources of the StableLM-3B model ([link](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo)), conducting a further pre-training with 150B tokens.
|
36 |
|
|
|
8 |
|
9 |
## Model Architecture
|
10 |
|
11 |
+
To push the model's sparsity, we add a ReLU component after GLU component, called dReLU(double ReLU) So our FFN network works as follows:
|
12 |
|
13 |
```Python
|
14 |
class BambooMLP(nn.Module):
|
|
|
30 |
|
31 |
In this section, we introduce the details of training our model, including types of data used, and hyperparameters.
|
32 |
|
33 |
+
We initialized the model weights to Mistral's model weights and modified the FFN structure to the dReLU structure, then continued pre-training for 200B tokens, divided into two phases:
|
34 |
|
35 |
**First phase**: For the proportion of training corpus, we followed the data mix ratio and sources of the StableLM-3B model ([link](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo)), conducting a further pre-training with 150B tokens.
|
36 |
|