bert-topic-classification-turkish
Model Description
This is a fine-tuned BERT model for topic classification on Turkish text data. The model is trained on a custom dataset, Turkish_Conversations, consisting of Turkish customer support conversations. The model classifies text into the following 5 categories:
- Financial Services (Finansal Hizmetler)
- Account Operations (Hesap İşlemleri)
- Technical Support (Teknik Destek)
- Products and Sales (Ürün ve Satış)
- Returns and Exchanges (İade ve Değişim)
The model achieves an accuracy of 93.51% on the validation dataset.
Usage
Below is an example of how to use the model for topic classification:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("GosamaIKU/bert-topic-classification-turkish")
model = AutoModelForSequenceClassification.from_pretrained("GosamaIKU/bert-topic-classification-turkish")
# Example dataset
dataset = [
{"conversation_id": 1, "speaker": "customer", "text": "Siparişim eksik geldi."},
{"conversation_id": 1, "speaker": "representative", "text": "Hemen kontrol edip size bilgi vereceğim."},
{"conversation_id": 1, "speaker": "customer", "text": "Anlayışınız için teşekkür ederim."}
]
# Combine texts for topic analysis
combined_text = " ".join([item["text"] for item in dataset])
inputs = tokenizer(combined_text, return_tensors="pt")
outputs = model(**inputs)
# Access topic classification results
logits = outputs.logits
predicted_class = logits.argmax(dim=1).item()
print(f"Predicted Topic Class ID: {predicted_class}")
Training Details
- Base Model: dbmdz/bert-base-turkish-cased
- Dataset: Turkish_Conversations (Custom dataset for Turkish customer support)
- Epochs: 5
- Batch Size: 8
- Learning Rate: 0.00005
- Accuracy: 93.51%
- Framework: PyTorch
Limitations
- The model may not perform well on text significantly different from the training data (e.g., informal or slang language).
- It is designed for topic classification and may not generalize to other NLP tasks like sentiment analysis or intent detection.
- Performance may degrade on very short or ambiguous texts.
Model Files
This repository contains the following files:
config.json
: Model configuration file.model.safetensors
: Model weights.special_tokens_map.json
: Special tokens used in the tokenizer.tokenizer_config.json
: Tokenizer configuration file.vocab.txt
: Vocabulary file for the tokenizer.
Links and Resources
- Base Model: dbmdz/bert-base-turkish-cased
- Zero-Shot Model (Optional): xlm-roberta-large-xnli
- Fine-Tuned Model: GosamaIKU/bert-topic-classification-turkish
Dataset
The model was fine-tuned on a custom dataset named Turkish_Conversations, which consists of 2,695 Turkish customer support conversations. The dataset includes text labeled into the following categories:
- Financial Services
- Account Operations
- Technical Support
- Products and Sales
- Returns and Exchanges
If you wish to access this dataset, please upload it to the repository or share a link to download it.
License
This model is licensed under the MIT License. See the LICENSE file for more details.
- Downloads last month
- 5
Model tree for GosamaIKU/bert-topic-classification-turkish
Base model
dbmdz/bert-base-turkish-cased