bert-topic-classification-turkish

Model Description

This is a fine-tuned BERT model for topic classification on Turkish text data. The model is trained on a custom dataset, Turkish_Conversations, consisting of Turkish customer support conversations. The model classifies text into the following 5 categories:

  1. Financial Services (Finansal Hizmetler)
  2. Account Operations (Hesap İşlemleri)
  3. Technical Support (Teknik Destek)
  4. Products and Sales (Ürün ve Satış)
  5. Returns and Exchanges (İade ve Değişim)

The model achieves an accuracy of 93.51% on the validation dataset.


Usage

Below is an example of how to use the model for topic classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("GosamaIKU/bert-topic-classification-turkish")
model = AutoModelForSequenceClassification.from_pretrained("GosamaIKU/bert-topic-classification-turkish")

# Example dataset
dataset = [
    {"conversation_id": 1, "speaker": "customer", "text": "Siparişim eksik geldi."},
    {"conversation_id": 1, "speaker": "representative", "text": "Hemen kontrol edip size bilgi vereceğim."},
    {"conversation_id": 1, "speaker": "customer", "text": "Anlayışınız için teşekkür ederim."}
]

# Combine texts for topic analysis
combined_text = " ".join([item["text"] for item in dataset])
inputs = tokenizer(combined_text, return_tensors="pt")
outputs = model(**inputs)

# Access topic classification results
logits = outputs.logits
predicted_class = logits.argmax(dim=1).item()
print(f"Predicted Topic Class ID: {predicted_class}")

Training Details

  • Base Model: dbmdz/bert-base-turkish-cased
  • Dataset: Turkish_Conversations (Custom dataset for Turkish customer support)
  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 0.00005
  • Accuracy: 93.51%
  • Framework: PyTorch

Limitations

  • The model may not perform well on text significantly different from the training data (e.g., informal or slang language).
  • It is designed for topic classification and may not generalize to other NLP tasks like sentiment analysis or intent detection.
  • Performance may degrade on very short or ambiguous texts.

Model Files

This repository contains the following files:

  • config.json: Model configuration file.
  • model.safetensors: Model weights.
  • special_tokens_map.json: Special tokens used in the tokenizer.
  • tokenizer_config.json: Tokenizer configuration file.
  • vocab.txt: Vocabulary file for the tokenizer.

Links and Resources


Dataset

The model was fine-tuned on a custom dataset named Turkish_Conversations, which consists of 2,695 Turkish customer support conversations. The dataset includes text labeled into the following categories:

  • Financial Services
  • Account Operations
  • Technical Support
  • Products and Sales
  • Returns and Exchanges

If you wish to access this dataset, please upload it to the repository or share a link to download it.


License

This model is licensed under the MIT License. See the LICENSE file for more details.

Downloads last month
5
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for GosamaIKU/bert-topic-classification-turkish

Finetuned
(127)
this model