Memory allocation
I want to run mpt-7b-8k for summarization tasks for long text and to benefit from its 8k max tokens strength.
I have a desktop with 64.0 GB RAM and a RTX-3090 with 24.0 GB GPU. I use python 3.10 and torch 2.1.1-gpu.
The model is downloaded in a folder model_dir and I apply the following code:
model = AutoModelForCausalLM.from_pretrained(model_dir)
model = model.half()
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
I submit intputs of length 8192
inputs = tokenizer(chunk, return_tensors='pt').to(device)
convert pdf to text, split the text in a series of chunks of size 8096 and then summarize each chung with the following code.
torch.cuda.empty_cache()
summary_ids = model.generate(inputs['input_ids'],
num_beams=4,
max_length=CHUNK_SIZE,
early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
torch.cuda.empty_cache()
I get an error message that states:
OutOfMemoryError: CUDA out of memory. Tried to allocate 46.61 GiB. GPU 0 has a total capacty of 24.00 GiB of which 5.71 GiB is free. Of the allocated memory 16.11 GiB is allocated by PyTorch, and 186.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have tried but haven't found a workable solution.
Could you please help ?
Thanks in advance
you have ~10GB on your GPU given how much space the model takes up. With 4 beams, you are trying to load 4x8k sequences and those activations take up much more memory than you have. You should try top_p sampling instead, which does not need beams and gets similar quality. you may have to input shorter sequences and cap the max summary length. You can also experiment with 8bit with bitsandbytes and that gives you a bit more space for activations.