Why did Model size increase when applying PoSE?
I notice the increase in model size of LLaMA 2. What is reason behind it.
What part of architecture of LLaMA is change which could increase model size by large margin?
Hi
@sanjeev-bhandari01
, thanks for your attention in this work. However, I don't think PoSE will increase model size. It just changes the position ids and rope_base during continual pre-training phase. In this repo, pytorch_model-00001/2/3-of-00003.bin add up to approximately 28G, which is very reasonable for a 7B model, as each parameter takes 4 bytes when torch_dtype
in the config
file is set to float32
. Looking forward to your reply :-)
Hi @dwzhu , understood. So all the model parameters are in float32.
I'm a bit unsure about the process. Should I infer with this model directly loaded from AutoModelForCausalLM
and perform the usual inference, or should I first modify the config to set the context length to 16k
for inference?
To explore this, I attempted to load the model in Colab(free version) using fp4
quantization and performed the usual inference without modifying the config. However, I encountered a CUDA out-of-memory error when trying to infer the context of 6300
tokens.
Hi
@sanjeev-bhandari01
, since the HF implementation of rope scaling is slightly different now compared with when this work is done, I think directly load from AutoModelForCausalLM
will not work. Maybe you can find some examples of testing this model here. Basically, it uses pose_modeling_llama.py
to define model behaviors, which have integrated xformers
to avoid OOM in self-attention module.