--- language: tr tags: - text-classification - customer-support - Turkish datasets: - Turkish_Conversations license: mit model_name: bert-topic-classification-turkish base_model: dbmdz/bert-base-turkish-cased library_name: transformers pipeline_tag: text-classification --- # bert-topic-classification-turkish ## Model Description This is a fine-tuned BERT model for topic classification on Turkish text data. The model is trained on a custom dataset, **Turkish_Conversations**, consisting of Turkish customer support conversations. The model classifies text into the following 5 categories: 1. **Financial Services** (Finansal Hizmetler) 2. **Account Operations** (Hesap İşlemleri) 3. **Technical Support** (Teknik Destek) 4. **Products and Sales** (Ürün ve Satış) 5. **Returns and Exchanges** (İade ve Değişim) The model achieves an accuracy of **93.51%** on the validation dataset. --- ## Usage Below is an example of how to use the model for topic classification: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained("GosamaIKU/bert-topic-classification-turkish") model = AutoModelForSequenceClassification.from_pretrained("GosamaIKU/bert-topic-classification-turkish") # Example dataset dataset = [ {"conversation_id": 1, "speaker": "customer", "text": "Siparişim eksik geldi."}, {"conversation_id": 1, "speaker": "representative", "text": "Hemen kontrol edip size bilgi vereceğim."}, {"conversation_id": 1, "speaker": "customer", "text": "Anlayışınız için teşekkür ederim."} ] # Combine texts for topic analysis combined_text = " ".join([item["text"] for item in dataset]) inputs = tokenizer(combined_text, return_tensors="pt") outputs = model(**inputs) # Access topic classification results logits = outputs.logits predicted_class = logits.argmax(dim=1).item() print(f"Predicted Topic Class ID: {predicted_class}") ``` ## Training Details - **Base Model:** [dbmdz/bert-base-turkish-cased](https://huggingface.co/dbmdz/bert-base-turkish-cased) - **Dataset:** **Turkish_Conversations** (Custom dataset for Turkish customer support) - **Epochs:** 5 - **Batch Size:** 8 - **Learning Rate:** 0.00005 - **Accuracy:** 93.51% - **Framework:** PyTorch --- ## Limitations - The model may not perform well on text significantly different from the training data (e.g., informal or slang language). - It is designed for topic classification and may not generalize to other NLP tasks like sentiment analysis or intent detection. - Performance may degrade on very short or ambiguous texts. --- ## Model Files This repository contains the following files: - `config.json`: Model configuration file. - `model.safetensors`: Model weights. - `special_tokens_map.json`: Special tokens used in the tokenizer. - `tokenizer_config.json`: Tokenizer configuration file. - `vocab.txt`: Vocabulary file for the tokenizer. --- ## Links and Resources - **Base Model:** [dbmdz/bert-base-turkish-cased](https://huggingface.co/dbmdz/bert-base-turkish-cased) - **Zero-Shot Model (Optional):** [xlm-roberta-large-xnli](https://huggingface.co/joeddav/xlm-roberta-large-xnli) - **Fine-Tuned Model:** [GosamaIKU/bert-topic-classification-turkish](https://huggingface.co/GosamaIKU/bert-topic-classification-turkish) --- ## Dataset The model was fine-tuned on a custom dataset named **Turkish_Conversations**, which consists of 2,695 Turkish customer support conversations. The dataset includes text labeled into the following categories: - Financial Services - Account Operations - Technical Support - Products and Sales - Returns and Exchanges If you wish to access this dataset, please upload it to the repository or share a link to download it. --- ## License This model is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.