bert-topic-classification-turkish

Model Description

This is a fine-tuned BERT model for topic classification on Turkish text data. The model is trained on a custom dataset, Turkish_Conversations, consisting of Turkish customer support conversations. The model classifies text into the following 5 categories:

Financial Services (Finansal Hizmetler)
Account Operations (Hesap İşlemleri)
Technical Support (Teknik Destek)
Products and Sales (Ürün ve Satış)
Returns and Exchanges (İade ve Değişim)

The model achieves an accuracy of 93.51% on the validation dataset.

Usage

Below is an example of how to use the model for topic classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("GosamaIKU/bert-topic-classification-turkish")
model = AutoModelForSequenceClassification.from_pretrained("GosamaIKU/bert-topic-classification-turkish")

# Example dataset
dataset = [
    {"conversation_id": 1, "speaker": "customer", "text": "Siparişim eksik geldi."},
    {"conversation_id": 1, "speaker": "representative", "text": "Hemen kontrol edip size bilgi vereceğim."},
    {"conversation_id": 1, "speaker": "customer", "text": "Anlayışınız için teşekkür ederim."}
]

# Combine texts for topic analysis
combined_text = " ".join([item["text"] for item in dataset])
inputs = tokenizer(combined_text, return_tensors="pt")
outputs = model(**inputs)

# Access topic classification results
logits = outputs.logits
predicted_class = logits.argmax(dim=1).item()
print(f"Predicted Topic Class ID: {predicted_class}")

Training Details

Base Model: dbmdz/bert-base-turkish-cased
Dataset: Turkish_Conversations (Custom dataset for Turkish customer support)
Epochs: 5
Batch Size: 8
Learning Rate: 0.00005
Accuracy: 93.51%
Framework: PyTorch

Limitations

The model may not perform well on text significantly different from the training data (e.g., informal or slang language).
It is designed for topic classification and may not generalize to other NLP tasks like sentiment analysis or intent detection.
Performance may degrade on very short or ambiguous texts.

Model Files

This repository contains the following files:

config.json: Model configuration file.
model.safetensors: Model weights.
special_tokens_map.json: Special tokens used in the tokenizer.
tokenizer_config.json: Tokenizer configuration file.
vocab.txt: Vocabulary file for the tokenizer.

Links and Resources

Base Model: dbmdz/bert-base-turkish-cased
Zero-Shot Model (Optional): xlm-roberta-large-xnli
Fine-Tuned Model: GosamaIKU/bert-topic-classification-turkish

Dataset

The model was fine-tuned on a custom dataset named Turkish_Conversations, which consists of 2,695 Turkish customer support conversations. The dataset includes text labeled into the following categories:

Financial Services
Account Operations
Technical Support
Products and Sales
Returns and Exchanges

If you wish to access this dataset, please upload it to the repository or share a link to download it.

License

This model is licensed under the MIT License. See the LICENSE file for more details.

GosamaIKU
/

bert-topic-classification-turkish