|
--- |
|
license: mit |
|
language: |
|
- km |
|
base_model: |
|
- FacebookAI/xlm-roberta-base |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
tags: |
|
- sentiment |
|
--- |
|
|
|
**This is a fine-tuned version of the XLM-RoBERTa model for sentiment analysis to classify khmer texts into 2 categories; Postive and Negative.** |
|
|
|
**It can process texts up to 512 tokens and performs well on khmer text inputs.** |
|
|
|
- **Task**: Sentiment analysis (binary classification). |
|
- **Languages Supported**: Khmer. |
|
- **Intended Use Cases**: |
|
- Analyzing customer reviews. |
|
- Social media sentiment detection. |
|
- **Limitations**: |
|
- Performance may degrade on languages or domains not present in the training data. |
|
- Does not handle sarcasm or highly ambiguous inputs well. |
|
- |
|
The model was evaluated on a test set of 400 samples, achieving the following performance: |
|
|
|
- **Test Accuracy**: 83.25% |
|
- **Precision**: 83.55% |
|
- **Recall**: 83.25% |
|
- **F1 Score**: 83.25% |
|
|
|
Confusion Matrix: |
|
| Predicted\Actual | Negative | Positive | |
|
|-------------------|----------|----------| |
|
| **Negative** | 166 | 42 | |
|
| **Positive** | 25 | 167 | |
|
The model supports a maximum sequence length of 512 tokens. |
|
## How to Use |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("tykea/khmer-text-sentiment-analysis-roberta") |
|
model = AutoModelForSequenceClassification.from_pretrained("tykea/khmer-text-sentiment-analysis-roberta") |
|
|
|
text = "អគុណCADT" |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
outputs = model(**inputs) |
|
predictions = outputs.logits.argmax(dim=1) |
|
labels_mapping = {0: 'negative', 1: 'positive'} |
|
print("Predicted Class:", labels_mapping[predictions.item()]) |