dpkrm commited on
Commit
dd9d7dd
·
1 Parent(s): cee7848
Files changed (7) hide show
  1. .gitignore +4 -0
  2. README.md +49 -0
  3. config.json +37 -0
  4. pytorch_model.bin +3 -0
  5. special_tokens_map.json +7 -0
  6. tokenizer_config.json +15 -0
  7. vocab.txt +0 -0
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ .DS_Store
2
+ dataset.png
3
+ loss_value.png
4
+ training_accuracy.png
README.md CHANGED
@@ -1,3 +1,52 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Sentiment Analysis with BERT: Nepali Sentiment Dataset
6
+
7
+ This repository contains code for training and evaluating a sentiment analysis model using the BERT (Bidirectional Encoder Representations from Transformers) model on the Nepali Sentiment Dataset. The model achieves an accuracy of 99.75% on the test dataset.
8
+
9
+ ## Dataset
10
+
11
+ The dataset used for training and testing the sentiment analysis model is a balanced dataset in CSV format. The dataset is loaded using the `pandas` library. The training dataset consists of 2084 balanced data, and the test dataset consists of 2001 balanced data.
12
+ ![Graph](/Users/deepakrana/Desktop/NepaliSentimentAnalysis/dataset.png)
13
+
14
+ ## Model
15
+
16
+ The BERT model is used for sequence classification and is loaded from the `bert-base-multilingual-uncased` pre-trained model. The model is initialized with `num_labels=3` since we have three sentiment classes: positive, negative, and neutral.
17
+
18
+ ## Preprocessing
19
+
20
+ The dataset is preprocessed using the `NepaliSentimentDataset` class. The class takes the texts, labels, tokenizer, and maximum sequence length as inputs. The texts are preprocessed using regular expressions to remove special characters, usernames, and extra whitespace. The `tokenizer` from the Hugging Face `transformers` library is used to tokenize the texts and convert them into input IDs and attention masks. The preprocessed data is returned as a dictionary with the input IDs, attention masks, and labels.
21
+
22
+ ## Training
23
+
24
+ The model is trained using the `train_model` function. The function takes the model, train dataloader, and test dataloader as inputs. The model is trained for 10 epochs with an early stopping mechanism. The AdamW optimizer is used with a learning rate of 2e-5 and epsilon value of 1e-8. The function also includes additional connection layers before the classification layer of the BERT model. After each epoch, the model is evaluated on the test dataset.
25
+
26
+ ## Training Progress and Evaluation Metrics
27
+ This section provides insights into the training progress of the sentiment analysis model and includes graphs showing the loss values and accuracy values throughout the training process.
28
+
29
+ # Loss Value Graph
30
+ The graph below displays the training progress by showing the variation in the loss values across different epochs. It helps visualize the convergence of the model during training.
31
+
32
+ ![Graph](/Users/deepakrana/Desktop/NepaliSentimentAnalysis/loss_value.png)
33
+ Loss Value Graph
34
+
35
+ # Accuracy Value Graph
36
+ The following graph illustrates the accuracy values achieved by the model during the training process. It presents a clear picture of how the model's performance improves over time.
37
+
38
+ ![Graph](/Users/deepakrana/Desktop/NepaliSentimentAnalysis/training_accuracy.png)
39
+ Accuracy Value Graph
40
+
41
+ These graphs provide a visual representation of the training progress and performance of the sentiment analysis model, allowing for better understanding and analysis of the results.
42
+
43
+ ## Results
44
+
45
+ After training, the trained model achieves an accuracy of 99.75% on the test dataset.
46
+
47
+ ## Saving the Model
48
+ The trained model and tokenizer are saved using the `save_pretrained` function from the Hugging Face `transformers` library. The model and tokenizer are saved in the directory `./model`.
49
+
50
+ ---
51
+
52
+ **Note:** The code provided is a simplified version for demonstration purposes. Additional error handling, logging, and hyperparameter tuning can be added for production use.
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bert-base-uncased",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0",
14
+ "1": "LABEL_1",
15
+ "2": "LABEL_2"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 3072,
19
+ "label2id": {
20
+ "LABEL_0": 0,
21
+ "LABEL_1": 1,
22
+ "LABEL_2": 2
23
+ },
24
+ "layer_norm_eps": 1e-12,
25
+ "max_position_embeddings": 512,
26
+ "model_type": "bert",
27
+ "num_attention_heads": 12,
28
+ "num_hidden_layers": 12,
29
+ "pad_token_id": 0,
30
+ "position_embedding_type": "absolute",
31
+ "problem_type": "single_label_classification",
32
+ "torch_dtype": "float32",
33
+ "transformers_version": "4.30.1",
34
+ "type_vocab_size": 2,
35
+ "use_cache": true,
36
+ "vocab_size": 30522
37
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00de4aecbec3d003d7d94334a955fa58f25e68ecc2910bd30b98dd9c79c85172
3
+ size 438010997
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": true,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 512,
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "BertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff