toolify-text-embedding-001
This is a fine-tuned version of intfloat/multilingual-e5-small optimized for text embedding tasks, particularly for multilingual scenarios including Indonesian and English text.
Model Details
- Base Model: intfloat/multilingual-e5-small
- Model Type: Sentence Transformer / Text Embedding Model
- Language Support: Multilingual (optimized for Indonesian and English)
- Fine-tuning: Custom dataset for improved embedding quality
- Vector Dimension: 384 (inherited from base model)
Intended Use
This model is designed for:
- Semantic Search: Finding similar documents or texts
- Text Similarity: Measuring semantic similarity between texts
- Information Retrieval: Document ranking and retrieval systems
- Clustering: Grouping similar texts together
- Classification: Text classification tasks using embeddings
Usage
Using Sentence Transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('wardydev/toolify-text-embedding-001')
# Encode sentences
sentences = [
"Ini adalah contoh kalimat dalam bahasa Indonesia",
"This is an example sentence in English",
"Model ini dapat memproses teks multibahasa"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item()}")
Using Transformers Library
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wardydev/toolify-text-embedding-001')
model = AutoModel.from_pretrained('wardydev/toolify-text-embedding-001')
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Encode text
sentences = ["Your text here"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(f"Embeddings: {embeddings}")
Performance
The model has been fine-tuned on a custom dataset to improve performance on:
- Indonesian text understanding
- Cross-lingual similarity tasks
- Domain-specific text embedding
Training Details
- Base Model: intfloat/multilingual-e5-small
- Training Framework: Sentence Transformers
- Fine-tuning Method: Custom training on domain-specific data
- Training Environment: Google Colab
Technical Specifications
- Model Size: ~118MB (inherited from base model)
- Embedding Dimension: 384
- Max Sequence Length: 512 tokens
- Architecture: BERT-based encoder
- Pooling: Mean pooling
Evaluation
The model shows improved performance on:
- Semantic textual similarity tasks
- Cross-lingual retrieval
- Indonesian language understanding
- Domain-specific embedding quality
Limitations
- Performance may vary on out-of-domain texts
- Optimal performance requires proper text preprocessing
- Limited to 512 token sequences
- May require specific prompt formatting for best results
License
This model is released under the Apache 2.0 license, following the base model's licensing terms.
Citation
If you use this model, please cite:
@misc{toolify-text-embedding-001,
title={toolify-text-embedding-001: Fine-tuned Multilingual Text Embedding Model},
author={wardydev},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/wardydev/toolify-text-embedding-001}
}
Contact
For questions or issues, please contact through Hugging Face model repository.
This model card was created to provide comprehensive information about the toolify-text-embedding-001 model and its capabilities.
- Downloads last month
- 22
Model tree for wardydev/toolify-text-embedding-001
Base model
intfloat/multilingual-e5-smallEvaluation results
- Cosine Similarity on Custom Datasetself-reported0.850
- Spearman Correlation on Custom Datasetself-reported0.820