--- license: mit datasets: - google-research-datasets/go_emotions language: - en library_name: transformers tags: - sentiment --- # Fine-Tuned MiniLM for GoEmotions Sentiment Analysis This repository contains a fine-tuned version of Microsoft's MiniLM-v2 model, specifically optimized for sentiment analysis using the GoEmotions dataset. The model is capable of classifying text into the following emotional/sentiment categories: This model is just **90MB** making it ideal for memory constraint environments. * anger * approval * confusion * disappointment * disapproval * gratitude * joy * sadness * neutral These sentiments more or less cover all the sentiments that can be in a sentence. Useful for validating sentiment analysis models. Label Analogy when using Inference: ``` { "LABEL_0":anger, "LABEL_1":approval, "LABEL_2":confusion, "LABEL_3":disappointment, "LABEL_4":disapproval, "LABEL_5":gratitude, "LABEL_6":joy, "LABEL_7":sadness, "LABEL_8":neutral } ``` ## Why MiniLM? MiniLM is a distilled version of larger language models like BERT and RoBERTa. It strikes a remarkable balance between performance and efficiency: * **Reduced Size:** MiniLM is significantly smaller than its parent models, making it faster to load and deploy, especially in resource-constrained environments. * **Comparable Performance:** Despite its compact size, MiniLM maintains surprisingly high accuracy on various natural language processing (NLP) tasks, including sentiment analysis. * **Distillation Power:** MiniLM's distillation technique ensures that it captures the essential knowledge of larger models, making it a potent tool for real-world applications. ## GoEmotions Dataset google-research-datasets/go_emotions The GoEmotions dataset is a valuable resource for sentiment analysis. It consists of thousands of Reddit comments labeled with the nine emotional/sentiment classes listed above. This dataset's richness in diverse expressions of emotions makes it an ideal choice for training a versatile sentiment analysis model. ## Training Procedure 1. **Data Preprocessing:** The GoEmotions dataset was preprocessed to ensure consistency and remove noise. 2. **Tokenizer:** The MiniLM-v2 tokenizer was used to convert text into numerical representations suitable for the model. 3. **Fine-Tuning:** The MiniLM-v2 model was fine-tuned on the GoEmotions dataset using a standard training loop. The model's parameters were adjusted to optimize its performance on sentiment classification. 4. **Evaluation:** The fine-tuned model was evaluated on a held-out test set to measure its accuracy and generalization capabilities. ## How to Use This Model ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification required_sentiments = ['anger', 'approval', 'confusion', 'disappointment', 'disapproval', 'gratitude', 'joy', 'sadness', 'neutral'] model = AutoModelForSequenceClassification.from_pretrained('./saved_model') tokenizer = AutoTokenizer.from_pretrained('./saved_model') text = "How can you be so careless" inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=128) model.eval() with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1).item() # Map the label to sentiment label_mapping = {idx: sentiment for idx, sentiment in enumerate(required_sentiments)} predicted_sentiment = label_mapping[predictions] print(f'Text: {text}') print(f'Predicted Sentiment: {predicted_sentiment}')