⚠Warning⚠ this is an experimental weight. It may not have practical performance.
Also, the model file must be manually rewritten or replaced to use this weight.
The model file is available here.
https://github.com/lucidrains/BS-RoFormer
The BS-Roformer has been updated in terms of architecture for the first time in a while.
In the 0.5.x update, a mechanism called "Value Residual Learning" was introduced. (https://arxiv.org/abs/2410.17897)
The paper argues that this mechanism can reduce the over-focus of attention and further reduce the vanishing gradient problem.