[`refactor`]: Tab & URL syncing; parameter counts as model size; filtering; search

#89
by tomaarsen HF staff - opened
Massive Text Embedding Benchmark org
edited Mar 29, 2024

Hello!

Pull Request overview

  • Compute model size based on the number of parameters instead of the weight file size.
  • Refactor Gradio initialization: now based on a nested data structure that is looped over to dynamically create Tabs.
  • Tabs & URL syncing for easier sharing, e.g. selecting a tab adds ?task=overall&language=english to the URL, and opening such a URL opens those tabs.
  • Add search bar.
  • Add filtering options: Open vs API and based on model sizes.
  • Show the model size in all tabs.

Details

Most of the changes in this PR are centered around the refactor that allows for dynamically creating Tabs. For example, the Tab & URL syncing requires some code surrounding each gr.Tabs, which was infeasible before this refactor. Due to the size of the PR, perhaps it makes sense to review the commits separately.

Model size based on # of parameters (commit)

I've introduced a utility function that computes the number of parameters by 1) reading safetensors or 2) estimating based on file size (assuming fp32 for all estimated models). I've also added a KNOWN_BYTES_PER_PARAM mapping from model names to the number of bytes per parameter (e.g. 4 for fp32 and 2 for fp16), in case we find a model that 1) does not use safetensors and 2) stored weights in fp16.
Beyond that, I updated the external model sizes from GB to Million Parameters.
Sadly this does cost another request for each model, so it's a bit slower to refresh.

Refactor Tabs initialization (commit)

This commit cuts ~400 lines from the app.py by using a new data structure:

data = {
    "Overall": {
        "metric": "Various, refer to task tabs",
        "data": [
            {
                "language": "English",
                "description": "**Overall MTEB English leaderboard** 🔮",
                "data": DATA_OVERALL,
                "refresh": get_mteb_average,
            },
        ...

and then looping over this to dynamically create the Tabs. The functionality before & after this commit should be identical.

Tabs <-> URL syncing (commit)

This is fairly hacky I'm afraid, as we have to call a JavaScript function to update the current URL. This is possible with Gradio, but you have to provide e.g. gr.JSON(), you can't just provide a normal Python dictionary. So, we use invisible JSON instances:

    # Store the current task and language for updating the URL. This is a bit hacky, but it works
    # for passing the current task and language to the JavaScript function via Gradio
    current_task_language = gr.JSON(value=dict(), visible=False)
    language_per_task = gr.JSON(value=dict(), visible=False)

Then, every time a tab is selected, we 1) update those gr.JSON instances and 2) call the JS function.

That's the Tabs -> URL step done. To do the opposite, we use the set_tabs_on_load function. Upon loading, it will observe the request URL and set the selected tabs accordingly. This is only ran when the leaderboard is loaded fresh for a user.

Search & Filtering (commit)

This is fairly standard; I've added filtering for Proprietary vs Open models & on the number of parameters. For convenience, you can also filter directly for models that are compatible with Sentence Transformers. I've also made it so that the model size is always shown, on all tabs. I think this is a very important piece of information that should not just be shown in the Overall tab, and it also simplifies the filtering heavily. As a result, the PR looks a bit messy, but it's fairly simple. I add a filter_data function that gets all dataframes as well as all filtering/search options, and returns the filtered dataframes again.

Fix embedding dimensions if Dense module exists (commit)

For models that use Dense layers, such as e.g. https://huggingface.co/aspire/acge_text_embedding, the embedding dimension is not computed correctly. This is now fixed by also accounting for \d+_Dense/config.json configuration files.

  • Tom Aarsen
tomaarsen changed pull request status to open
Massive Text Embedding Benchmark org

@Muennighoff
Should be ready for review now! Apologies for the sheer size of this one. I'm excited for these changes to come through.

  • Tom Aarsen
Massive Text Embedding Benchmark org

This looks amazing. Can we allow a number of parameters for API models? For voyage they explicitly asked if we could show their number of parameters as it's useful to know for some people and said it would be 1.22B parameters / 2.45GB ?

Massive Text Embedding Benchmark org
edited Mar 30, 2024

Certainly! The PR currently labels models as "Proprietary" only if there is no known # of parameters, but I'll update that to "if there is no known # of parameters OR the model is in a specific list of exceptions" which will initially only contain the one Voyage model. In truth, I thought that perhaps the voyage model size was an error/oversight 😄 Out of curiosity, is it correct that only the model size for voyage-lite-02-instruct should be public? I couldn't find any info on model sizes on https://docs.voyageai.com/docs/embeddings (and voyage-lite-02-instruct is listed as a Deprecated model in https://docs.voyageai.com/docs/pricing ?). cc @voyageai01

Edit: I'll actually determine the proprietary models exclusively through a list of models - otherwise gated models will be listed as proprietary.

I'll probably make the change I mentioned on ~Tuesday.

  • Tom Aarsen
Massive Text Embedding Benchmark org
edited Mar 30, 2024

They only shared voyage-lite-02-instruct sizes with us but maybe they also want to share the other model sizes? cc @Shuang59 @hongliu9903

Also two more notes:

  • When unselecting a parameter range & then reselecting it proprietary models are gone even though I'd expect it to be back to the start
  • If selecting only Proprietary Models some Open Models like udever remain in the ranking, I guess because we cannot grab their model size

Looks really really cool!

Massive Text Embedding Benchmark org

Thanks for the details & for testing this out! I'll include fixes for these in the coming days.

Massive Text Embedding Benchmark org

@Muennighoff I've addressed all the comments.

  1. 56136076 fixes e.g. udever remaining in the ranking (I can still grab its model size though?) and voyage's model size is now listed again.
  2. 485f27b4 fixes the proprietary models from disappearing when toggling the model size.

Let me know if you need anything else from me here!

  • Tom Aarsen
Massive Text Embedding Benchmark org

I've also incremented the Gradio SDK version. This fixes an issue where the DataFrame header & table will separate when scrolling on Firefox.

  • Tom Aarsen
Massive Text Embedding Benchmark org

Nice though if I deactivate Proprietary the voyage model is still shown even though it is proprietary 🤔

Screenshot 2024-04-02 at 8.12.24 PM.png

Also interesting that e5-mistral & echo-mistral differ by 1 million parameters despite both stemming from the same model

Also do you think it is worth keeping the model size in GB tab in addition? I don't have a strong opinion but maybe it's useful to some people

Massive Text Embedding Benchmark org

Nice though if I deactivate Proprietary the voyage model is still shown even though it is proprietary 🤔

Apologies, this was an oversight. I based only the Proprietary models on the PROPRIETARY_MODELS whereas the Open models were still based on the existence of the model size. I fixed this in 2db25dc3

Also interesting that e5-mistral & echo-mistral differ by 1 million parameters despite both stemming from the same model

I looked into this: e5-mistral-7b-instruct is listed as an external model, for which I estimated the number of models based on the model size. That explains the small difference. I've updated that to match SFR-Embedding-Mistral.
I appreciate the detailed reviews.

  • Tom Aarsen
Massive Text Embedding Benchmark org

LGTM! We could put something nicer than empty space when no model matches (e.g. telling the user that nothing matches) but feel free to merge without if you think it's not needed, really amazing work!

Screenshot 2024-04-02 at 9.46.44 PM.png

Massive Text Embedding Benchmark org

We could put something nicer than empty space when no model matches

I'll look into that actually!
That also reminds me that I can add n/a as the model size for the proprietary models.

And I forgot to address your comment here:

Also do you think it is worth keeping the model size in GB tab in addition? I don't have a strong opinion but maybe it's useful to some people

Hmm, I'm not 100% confident what's best here. I think it's a bit duplicate, but for inference the GB also somewhat refers to the memory requirements for inference I believe. I'll ask around a bit to get people's thoughts here.

Massive Text Embedding Benchmark org

I've merged the Law & Gecko changes into this PR and I've added the memory usage to all tables:

image.png

I'm considering renaming the column to just Memory Usage (GB). I've verified the correctness using this script:

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("intfloat/multilingual-e5-large-instruct", device="cuda")
print(f"{torch.cuda.max_memory_allocated() / 1024**3:.2f}GB in use after loading model")
2.09GB in use after loading model

and for mixedbread-ai/mxbai-embed-large-v1:

1.25GB in use after loading model

I can't find a convenient way to add a hint/warning when the filtering is too restrictive resulting in an empty table, nor does the "n/a" in the model size for proprietary models work (it messes up the column sorting). With other words, I think this might be ready for final review & to be merged. @Muennighoff

  • Tom Aarsen
Massive Text Embedding Benchmark org

Looks amazing, final two points from my side:

  • Cohere-embed-english-v3.0 is also proprietary but still there if unselecting prop
  • Can we add 1200 Million parameters for the Gecko models? Also we can probably also add the GB based on what they gave us for the current lb (2.29) - I think you just multiplied it by 2 for voyage? I think that's fine & if it's wrong they can open an issue / let us know
Massive Text Embedding Benchmark org

Sorry, added one more model, voyage-2-law 😅

Massive Text Embedding Benchmark org

Haha, all good. Resolved the merge conflict & marked voyage-2-law as a proprietary model. It should correctly behave to the filtering options now. I also marked Cohere-embed-english-v3.0 as proprietary.

For all models I use the memory usage with the weights in fp32, so given a number of parameters I can accurately compute the VRAM usage. I've done that for Gecko and also Voyage.

Massive Text Embedding Benchmark org

Cool feel free to merge!

tomaarsen changed pull request status to merged
Massive Text Embedding Benchmark org

Seems like the URL syncing doesn't work, I'm guessing that is blocked by HF Spaces or something for security reasons.

Massive Text Embedding Benchmark org

I've opened an issue on that here: https://github.com/gradio-app/gradio/issues/7957

Massive Text Embedding Benchmark org

I appreciate it, I'll add a bit of extra context there.

Sign up or log in to comment