Update README.md
Browse files
README.md
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "en"
|
3 |
+
thumbnail: "https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg"
|
4 |
+
tags:
|
5 |
+
- India
|
6 |
+
- politics
|
7 |
+
- tweets
|
8 |
+
- BJP
|
9 |
+
- Congress
|
10 |
+
- AAP
|
11 |
+
- pytorch
|
12 |
+
- gpt2
|
13 |
+
- lm-head
|
14 |
+
- text-generation
|
15 |
+
license: "Apache"
|
16 |
+
datasets:
|
17 |
+
- Twitter
|
18 |
+
- IndianPolitics
|
19 |
+
---
|
20 |
+
|
21 |
+
# Model name
|
22 |
+
Indian Political Tweets LM Medium (Based on GPT2-Medium)
|
23 |
+
|
24 |
+
## Model description
|
25 |
+
|
26 |
+
This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this [blog](https://bagdeabhishek.github.io/twitterAnalysis) post.
|
27 |
+
|
28 |
+
This model is finetuned using GPT2-medium instead of the vanilla GPT2 implementation. This model has more parameters but it is able to model language slightly better.
|
29 |
+
|
30 |
+
## Intended uses & limitations
|
31 |
+
This finetuned model can be used to generate tweets which are related to Indian politics.
|
32 |
+
#### How to use
|
33 |
+
|
34 |
+
```python
|
35 |
+
from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
|
37 |
+
model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
|
38 |
+
|
39 |
+
text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer)
|
40 |
+
|
41 |
+
init_sentence = "India will always be"
|
42 |
+
|
43 |
+
print(text_generator(init_sentence))
|
44 |
+
|
45 |
+
```
|
46 |
+
|
47 |
+
#### Limitations and bias
|
48 |
+
1. The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text.
|
49 |
+
2. I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc.
|
50 |
+
3. Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model.
|
51 |
+
|
52 |
+
## Training data
|
53 |
+
I used the pre-trained gpt2-medium model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.
|
54 |
+
|
55 |
+
## Training procedure
|
56 |
+
|
57 |
+
For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.
|
58 |
+
|
59 |
+
I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.
|
60 |
+
|
61 |
+
### Hardware
|
62 |
+
1. GPU: GTX 1080Ti
|
63 |
+
2. CPU: Ryzen 3900x
|
64 |
+
3. RAM: 32GB
|
65 |
+
|
66 |
+
This model took roughly 36 hours to fine-tune.
|
67 |
+
|