pcunwa's picture
Create README.md
17b0f6d verified

⚠Warning⚠ this is an experimental weight. It may not have practical performance.
Also, the model file must be manually rewritten or replaced to use this weight.

The model file is available here.
https://github.com/lucidrains/BS-RoFormer

The BS-Roformer has been updated in terms of architecture for the first time in a while.
In the 0.5.x update, a mechanism called "Value Residual Learning" was introduced. (https://arxiv.org/abs/2410.17897)
The paper argues that this mechanism can reduce the over-focus of attention and further reduce the vanishing gradient problem.