system HF staff commited on
Commit
3b79712
·
1 Parent(s): 067f67b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail: "https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg"
4
+ tags:
5
+ - India
6
+ - politics
7
+ - tweets
8
+ - BJP
9
+ - Congress
10
+ - AAP
11
+ - pytorch
12
+ - gpt2
13
+ - lm-head
14
+ - text-generation
15
+ license: "Apache"
16
+ datasets:
17
+ - Twitter
18
+ - IndianPolitics
19
+ ---
20
+
21
+ # Model name
22
+ Indian Political Tweets LM Medium (Based on GPT2-Medium)
23
+
24
+ ## Model description
25
+
26
+ This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this [blog](https://bagdeabhishek.github.io/twitterAnalysis) post.
27
+
28
+ This model is finetuned using GPT2-medium instead of the vanilla GPT2 implementation. This model has more parameters but it is able to model language slightly better.
29
+
30
+ ## Intended uses & limitations
31
+ This finetuned model can be used to generate tweets which are related to Indian politics.
32
+ #### How to use
33
+
34
+ ```python
35
+ from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline
36
+ tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
37
+ model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
38
+
39
+ text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer)
40
+
41
+ init_sentence = "India will always be"
42
+
43
+ print(text_generator(init_sentence))
44
+
45
+ ```
46
+
47
+ #### Limitations and bias
48
+ 1. The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text.
49
+ 2. I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc.
50
+ 3. Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model.
51
+
52
+ ## Training data
53
+ I used the pre-trained gpt2-medium model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.
54
+
55
+ ## Training procedure
56
+
57
+ For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.
58
+
59
+ I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.
60
+
61
+ ### Hardware
62
+ 1. GPU: GTX 1080Ti
63
+ 2. CPU: Ryzen 3900x
64
+ 3. RAM: 32GB
65
+
66
+ This model took roughly 36 hours to fine-tune.
67
+