Karim-Gamal commited on
Commit
df587f8
·
1 Parent(s): 75eb0ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -4
README.md CHANGED
@@ -2,11 +2,89 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- - fr
6
- - it
7
  - es
 
 
8
  metrics:
9
  - f1
10
  ---
11
- Acc : 42.672 %
12
- Mac-F1 : 30.476 %
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  language:
4
  - en
 
 
5
  - es
6
+ - it
7
+ - fr
8
  metrics:
9
  - f1
10
  ---
11
+
12
+ # Federated Learning Based Multilingual Emoji Prediction
13
+
14
+ This repository contains code for training and evaluating transformer-based models for Uni/multilingual emoji prediction in clean and attack scenarios using Federated Learning. This work is described in the paper "Federated Learning-Based Multilingual Emoji Prediction in Clean and Attack Scenarios."
15
+
16
+ # Abstract
17
+
18
+ Federated learning is a growing field in the machine learning community due to its decentralized and private design. Model training in federated learning is distributed over multiple clients giving access to lots of client data while maintaining privacy. Then, a server aggregates the training done on these multiple clients without access to their data, which could be emojis widely used in any social media service and instant messaging platforms to express users' sentiments. This paper proposes federated learning-based multilingual emoji prediction in both clean and attack scenarios. Emoji prediction data have been crawled from both Twitter and SemEval emoji datasets. This data is used to train and evaluate different transformer model sizes including a sparsely activated transformer with either the assumption of clean data in all clients or poisoned data via label flipping attack in some clients. Experimental results on these models show that federated learning in either clean or attacked scenarios performs similarly to centralized training in multilingual emoji prediction on seen and unseen languages under different data sources and distributions. Our trained transformers perform better than other techniques on the SemEval emoji dataset in addition to the privacy as well as distributed benefits of federated learning.
19
+
20
+ # Performance
21
+ > * Acc : 42.672 %
22
+ > * Mac-F1 : 30.476 %
23
+ > * Also see our [GitHub Repo](https://github.com/kareemgamalmahmoud/FEDERATED-LEARNING-BASED-MULTILINGUAL-EMOJI-PREDICTION-IN-CLEAN-AND-ATTACK-SCENARIOS)
24
+
25
+ # Dependencies
26
+ > * Python 3.6+
27
+ > * PyTorch 1.7.0+
28
+ > * Transformers 4.0.0+
29
+
30
+ # Usage
31
+
32
+ > To use the model, first install the `transformers` package from Hugging Face:
33
+ ```python
34
+ pip install transformers
35
+ ```
36
+
37
+ > Then, you can load the model and tokenizer using the following code:
38
+ ```python
39
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
40
+ import numpy as np
41
+ import urllib.request
42
+ import csv
43
+ ```
44
+
45
+ ```python
46
+ MODEL = "Karim-Gamal/MMiniLM-L12-finetuned-emojis-1-client-toxic-FedAvg-non-IID-Fed"
47
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
48
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL)
49
+ ```
50
+
51
+ > Once you have the tokenizer and model, you can preprocess your text and pass it to the model for prediction:
52
+
53
+ ```python
54
+ # Preprocess text (username and link placeholders)
55
+ def preprocess(text):
56
+ new_text = []
57
+ for t in text.split(" "):
58
+ t = '@user' if t.startswith('@') and len(t) > 1 else t
59
+ t = 'http' if t.startswith('http') else t
60
+ new_text.append(t)
61
+ return " ".join(new_text)
62
+
63
+ text = "Hello world"
64
+ text = preprocess(text)
65
+ encoded_input = tokenizer(text, return_tensors='pt')
66
+ output = model(**encoded_input)
67
+ scores = output[0][0].detach().numpy()
68
+ ```
69
+
70
+ > The scores variable contains the probabilities for each of the possible emoji labels. To get the top k predictions, you can use the following code:
71
+
72
+ ```python
73
+ # download label mapping
74
+ labels=[]
75
+ mapping_link = "https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emoji/mapping.txt"
76
+ with urllib.request.urlopen(mapping_link) as f:
77
+ html = f.read().decode('utf-8').split("\n")
78
+ csvreader = csv.reader(html, delimiter='\t')
79
+ labels = [row[1] for row in csvreader if len(row) > 1]
80
+
81
+ k = 3 # number of top predictions to show
82
+ ranking = np.argsort(scores)
83
+ ranking = ranking[::-1]
84
+ for i in range(k):
85
+ l = labels[ranking[i]]
86
+ s = scores[ranking[i]]
87
+ print(f"{i+1}) {l} {np.round(float(s), 4)}")
88
+ ```
89
+
90
+ ## Note : this is the source for that code : [Link](https://huggingface.co/cardiffnlp/twitter-roberta-base-emoji)