bge-m3-ml-tr-specialized

bge-m3-ml-tr-specialized is a Sentence Transformer model optimized for scientific and technical machine learning texts in Turkish. Based on BAAI/bge-m3, the model has been fine-tuned for tasks such as sentence similarity, semantic search, conceptual matching, and meaning-based classification.

🧠 Model Specifications

  • Model Type: Sentence Transformer
  • Base Model: BAAI/bge-m3
  • Use Cases:
    • Sentence-level semantic similarity
    • Conceptual and contextual sentence alignment
    • Information retrieval and semantic search systems
    • Clustering and ranking of scientific documents
  • Language: Turkish (especially technical and scientific domain)
  • Maximum Sequence Length: 8192 tokens
  • Output Vector Dimension: 1024
  • Pooling Strategy: CLS token
  • Similarity Metric: Cosine Similarity

πŸ” Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True})
  (2): Normalize()
)

πŸš€ Quick Start

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("dogukanvzr/bge-m3-ml-tr-specialized")

sentences = [
    "Accuracy refers to how close a model's predictions are to the actual values.",
    "Model accuracy indicates how well the predictions align with true labels.",
    "Feature engineering plays a critical role in machine learning pipelines."
]

embeddings = model.encode(sentences)

from sklearn.metrics.pairwise import cosine_similarity
scores = cosine_similarity([embeddings[0]], embeddings[1:])
print(scores)

πŸ§ͺ Training Details

  • Dataset: ml-paraphrase-tr
  • Size: 60,000 sentence pairs
  • Structure: sentence_0, sentence_1, label (float between 0.0–1.0 indicating similarity)
  • Loss Function: CosineSimilarityLoss (internally uses MSELoss)
  • Training Epochs: 3
  • Batch Size: 64

πŸ“ˆ Training Log

Epoch Step Average Loss
0.5 500 0.0338
1.0 1000 0.0188
1.5 1500 0.0147
2.0 2000 0.0127
2.5 2500 0.0105

πŸ“Š Application Areas

This model is particularly well-suited for the following NLP and ML tasks in Turkish:

  • Semantic alignment in technical documents
  • Similarity detection in scientific and academic texts
  • Embedding-based information retrieval
  • Paraphrase identification (detecting meaning-equivalent sentence pairs)
  • Semantic clustering for topic grouping
  • Intent matching in QA and chatbot systems

πŸ’‘ Evaluation Example

s1 = "Machine learning algorithms learn from past data to make future predictions."
s2 = "The model performs inference based on learned patterns."
s3 = "The size of the dataset can affect the generalization capacity of the model."

embs = model.encode([s1, s2, s3])
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embs[0]], embs[1:])
print(sim)

βš™οΈ Development Environment

  • Python: 3.12.7
  • Sentence Transformers: 5.0.0
  • Transformers: 4.56.0.dev0
  • PyTorch: 2.7.1+cu128
  • Accelerate: 1.9.0
  • Datasets: 4.0.0
  • Tokenizers: 0.21.4

πŸ“š Citation

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

⚠️ Limitations

  • The model is trained primarily on scientific/technical Turkish text and may underperform on casual, figurative, or conversational language.
  • It might struggle with cultural references, idioms, or sarcasm.
  • Although trained on high-quality paraphrased data, users should still review outputs critically.

πŸ“¬ Contact & Feedback

For bug reports, suggestions, or contributions:

Downloads last month
306
Safetensors
Model size
568M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dogukanvzr/bge-m3-ml-tr-specialized

Base model

BAAI/bge-m3
Finetuned
(329)
this model

Dataset used to train dogukanvzr/bge-m3-ml-tr-specialized