Anıl Sevinç
Update model card with corrected categories
763b053
|
raw
history blame
4.01 kB
metadata
language: tr
tags:
  - text-classification
  - customer-support
  - Turkish
datasets:
  - Turkish_Conversations
license: mit
model_name: bert-topic-classification-turkish
base_model: dbmdz/bert-base-turkish-cased
library_name: transformers
pipeline_tag: text-classification

bert-topic-classification-turkish

Model Description

This is a fine-tuned BERT model for topic classification on Turkish text data. The model is trained on a custom dataset, Turkish_Conversations, consisting of Turkish customer support conversations. The model classifies text into the following 5 categories:

  1. Financial Services (Finansal Hizmetler)
  2. Account Operations (Hesap İşlemleri)
  3. Technical Support (Teknik Destek)
  4. Products and Sales (Ürün ve Satış)
  5. Returns and Exchanges (İade ve Değişim)

The model achieves an accuracy of 93.51% on the validation dataset.


Usage

Below is an example of how to use the model for topic classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("GosamaIKU/bert-topic-classification-turkish")
model = AutoModelForSequenceClassification.from_pretrained("GosamaIKU/bert-topic-classification-turkish")

# Example dataset
dataset = [
    {"conversation_id": 1, "speaker": "customer", "text": "Siparişim eksik geldi."},
    {"conversation_id": 1, "speaker": "representative", "text": "Hemen kontrol edip size bilgi vereceğim."},
    {"conversation_id": 1, "speaker": "customer", "text": "Anlayışınız için teşekkür ederim."}
]

# Combine texts for topic analysis
combined_text = " ".join([item["text"] for item in dataset])
inputs = tokenizer(combined_text, return_tensors="pt")
outputs = model(**inputs)

# Access topic classification results
logits = outputs.logits
predicted_class = logits.argmax(dim=1).item()
print(f"Predicted Topic Class ID: {predicted_class}")

Training Details

  • Base Model: dbmdz/bert-base-turkish-cased
  • Dataset: Turkish_Conversations (Custom dataset for Turkish customer support)
  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 0.00005
  • Accuracy: 93.51%
  • Framework: PyTorch

Limitations

  • The model may not perform well on text significantly different from the training data (e.g., informal or slang language).
  • It is designed for topic classification and may not generalize to other NLP tasks like sentiment analysis or intent detection.
  • Performance may degrade on very short or ambiguous texts.

Model Files

This repository contains the following files:

  • config.json: Model configuration file.
  • model.safetensors: Model weights.
  • special_tokens_map.json: Special tokens used in the tokenizer.
  • tokenizer_config.json: Tokenizer configuration file.
  • vocab.txt: Vocabulary file for the tokenizer.

Links and Resources


Dataset

The model was fine-tuned on a custom dataset named Turkish_Conversations, which consists of 2,695 Turkish customer support conversations. The dataset includes text labeled into the following categories:

  • Financial Services
  • Account Operations
  • Technical Support
  • Products and Sales
  • Returns and Exchanges

If you wish to access this dataset, please upload it to the repository or share a link to download it.


License

This model is licensed under the MIT License. See the LICENSE file for more details.