toolify-text-embedding-001

This is a fine-tuned version of intfloat/multilingual-e5-small optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text.

Model Details

  • Base Model: intfloat/multilingual-e5-small
  • Model Type: Sentence Transformer / Text Embedding Model
  • Language Support: Multilingual (optimized for Indonesian and English)
  • Fine-tuning: Custom dataset for improved embedding quality
  • Vector Dimension: 384 (inherited from base model)

Intended Use

This model is designed for:

  • Semantic Search: Finding similar documents or texts
  • Text Similarity: Measuring semantic similarity between texts
  • Information Retrieval: Document ranking and retrieval systems
  • Clustering: Grouping similar texts together
  • Classification: Text classification tasks using embeddings

Usage

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('wardydev/toolify-text-embedding-001')

# Encode sentences
sentences = [
    "Ini adalah contoh kalimat dalam bahasa Indonesia",
    "This is an example sentence in English",
    "Model ini dapat memproses teks multibahasa"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item()}")

Using Transformers Library

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001')
model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001')

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Encode text
sentences = ["Your text here"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)

print(f"Embeddings: {embeddings}")

Performance

The model has been fine-tuned on a custom dataset to improve performance on:

  • Indonesian text understanding
  • Cross-lingual similarity tasks
  • Domain-specific text embedding

Training Details

  • Base Model: intfloat/multilingual-e5-small
  • Training Framework: Sentence Transformers
  • Fine-tuning Method: Custom training on domain-specific data
  • Training Environment: Google Colab

Technical Specifications

  • Model Size: ~118MB (inherited from base model)
  • Embedding Dimension: 384
  • Max Sequence Length: 512 tokens
  • Architecture: BERT-based encoder
  • Pooling: Mean pooling

Evaluation

The model shows improved performance on:

  • Semantic textual similarity tasks
  • Cross-lingual retrieval
  • Indonesian language understanding
  • Domain-specific embedding quality

Limitations

  • Performance may vary on out-of-domain texts
  • Optimal performance requires proper text preprocessing
  • Limited to 512 token sequences
  • May require specific prompt formatting for best results

License

This model is released under the Apache 2.0 license, following the base model's licensing terms.

Citation

If you use this model, please cite:

@misc{toolify-text-embedding-001,
  title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model},
  author={wardydev},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/wardydev/toolify-text-embedding-001}
}

Contact

For questions or issues, please contact through Hugging Face model repository.


This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.

Downloads last month
22
Safetensors
Model size
118M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wardydev/toolify-text-embedding-001

Finetuned
(121)
this model

Evaluation results