Edit model card

About model creation

This is a smaller version of the intfloat/multilingual-e5-base with only some Russian (Cyrillic in general) and English (fever) tokens (and embeddings) left.

The model created in a similar way as described in this https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90 post.

The CulturaX dataset was used to search for the required tokens. As a result, out of 250k tokens of the original model, only 69,382 required were left.

Was the model trained in any way?

No. The tokenizer has been modified, and all changes to token identifiers have been corrected by moving embeddings in the model word_embeddings module to their new places, so the quality of this model on Cyrilic (and English) is exactly the same as the original one.

Why do we need this?

This allows you to use significantly less memory during training and also greatly reduces the weight of the model.

Authors

Downloads last month
4
Safetensors
Model size
139M params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.