Difference between SparseLLM/relu and SparseLLM/reglu - lack of modeling file?
Hi there,
I'm trying to understand the difference between SparseLLM/relu and SparseLLM/reglu, but their config files look very similar. Only intermidiate_size
is different. hidden_act
is set to relu
for both models.
Besides, relu-5b
seems not working properly. I guess you changed the modeling_llama.py
file to make it really a ReLU (ReLU(W_in * X)
) rather than ReGLU. Am I understanding correctly? If so, it would be better if you also open-source that modeling file. The difference is probably better clarified in the paper.
And thanks to the great work in relu^2-wins paper!
For the relu2/relu model, we do not have both up/gate projection. We just have a gate projection and a down projection.
For reglu model, we follow the typical gate, up, down projection.