Checking memory usages for each `truncate_dim`

#3
by kerem0comert - opened

I have the following script, through which I would like to compare the memory usages of each size of the model on my hardware:

import torch
from sentence_transformers import SentenceTransformer
import gc

def print_memory_usage(stage: str):
    if torch.cuda.is_available():
        allocated_memory: int = torch.cuda.memory_allocated() / (1024 ** 2)
        print(f"Memory usage after {stage}: {allocated_memory:.2f} MB")
    else:
        print(f"CUDA is not available, memory tracking skipped after {stage}.")

def test_model(truncate_dim: int, use_half: bool):
    print(f"Testing with truncate_dim={truncate_dim}, use_half={use_half}")

    model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True, truncate_dim=truncate_dim)

    print_memory_usage("loading model")

    # model.max_seq_length = 16  # Optionally limit sequence length
    
    sentences: list[str] = [
        'Eine Flagge weht.',
        'Die Flagge bewegte sich in der Luft.',
        'Zwei Personen beobachten das Wasser.',
    ]
    
    # to make memory usage more visible
    sentences = sentences * 1000

    embeddings = model.encode(sentences, convert_to_tensor=True)
    
    if use_half:
        embeddings = embeddings.half()
    
    assert embeddings.shape[0] == len(sentences)
    print(f"Embeddings shape: {embeddings.shape}")
    
    print_memory_usage("encoding sentences")
    
    # Get the similarity scores for the embeddings
    similarities = model.similarity(embeddings, embeddings)
    print_memory_usage("calculating similarities")
    
    del model
    gc.collect()
    print_memory_usage("after deleting model")


truncate_dims: list[int] = [64, 128, 256, 512]
use_half_options: list[bool] = [True, False]

for truncate_dim in truncate_dims:
    for use_half in use_half_options:
        test_model(truncate_dim, use_half)
        print("-" * 50)

But I get roughly the same output for each model size, either with half precision or not:

Testing with truncate_dim=64, use_half=True
Memory usage after loading model: 5375.36 MB
Embeddings shape: torch.Size([3000, 64])
Memory usage after encoding sentences: 5383.85 MB
Memory usage after calculating similarities: 5401.02 MB
Memory usage after after deleting model: 25.66 MB
--------------------------------------------------
Testing with truncate_dim=64, use_half=False
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 64])
Memory usage after encoding sentences: 5384.22 MB
Memory usage after calculating similarities: 5418.55 MB
Memory usage after after deleting model: 43.19 MB
--------------------------------------------------
Testing with truncate_dim=128, use_half=True
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 128])
Memory usage after encoding sentences: 5384.22 MB
Memory usage after calculating similarities: 5401.39 MB
Memory usage after after deleting model: 26.02 MB
--------------------------------------------------
Testing with truncate_dim=128, use_half=False
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 128])
Memory usage after encoding sentences: 5384.99 MB
Memory usage after calculating similarities: 5419.32 MB
Memory usage after after deleting model: 43.96 MB
--------------------------------------------------
Testing with truncate_dim=256, use_half=True
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 256])
Memory usage after encoding sentences: 5384.99 MB
Memory usage after calculating similarities: 5402.15 MB
Memory usage after after deleting model: 26.79 MB
--------------------------------------------------
Testing with truncate_dim=256, use_half=False
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 256])
Memory usage after encoding sentences: 5386.42 MB
Memory usage after calculating similarities: 5420.75 MB
Memory usage after after deleting model: 45.39 MB
--------------------------------------------------
Testing with truncate_dim=512, use_half=True
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 512])
Memory usage after encoding sentences: 5386.42 MB
Memory usage after calculating similarities: 5403.58 MB
Memory usage after after deleting model: 28.22 MB
--------------------------------------------------
Testing with truncate_dim=512, use_half=False
Memory usage after loading model: 5383.49 MB
Embeddings shape: torch.Size([3000, 512])
Memory usage after encoding sentences: 5389.35 MB
Memory usage after calculating similarities: 5423.68 MB
Memory usage after after deleting model: 48.32 MB
--------------------------------------------------

What exactly is going on here, is there something wrong with the methodology? If so how can this script be fixed to show the correct memory usages?

Hi - the truncation dim and ".half()" refer to the produced embedding, not the model.. So you will need less storage on the embeddings.
If you want to load in "half", FP16 instead of 32 bits, you need to:

from sentence_transformers import SentenceTransformer
import torch

Load the model with FP16 precision using model_kwargs

model = SentenceTransformer("aari1995/German_Semantic_V3", model_kwargs={"torch_dtype": torch.bfloat16})

Not sure about the performance etc. though..

Thanks for your clarification! Does this mean that neither half() nor truncate_dim changes VRAM usage at all? And do both parameters only control how much space the embeddings occupy in storage? (Assuming that the embeddings themselves are not kept in VRAM then).
I guess model = SentenceTransformer("aari1995/German_Semantic_V3", model_kwargs={"torch_dtype": torch.bfloat16}) that you suggest might decrease VRAM usage, but without quantization the performance might not be as expected as you point out.

Exactly it is all about the embeddings, as they are mostly taking too much space.
You should definitely check and experiment, it could also well be that there is no real difference as it was trained in mixed precision.

Generally I decided against quantization as mostly you‘d need a gpu to deploy it and not everyone has one, but feel free to quantize!

Not sure probably for inference you could also do model = model.half()

Sign up or log in to comment