--- title: Sentimentr desc: A tool to visualize bias in news headlines about presidential candidates published: true date_published: 2020-01-12 tags: nlp figs: linear: src: /assets/article-count-raw-and-opinion.png alt: Linear Graph of Article Counts caption: Bar plots of how many articles had a given candidate's name in it. Top is a raw count of total articles. Bottom separates it by news group. full_width: yes left: no log: src: /assets/article-count-log.png alt: Logarithmic Graph of Article Counts caption: Logarithmic barplots of how many articles had a given candidate's name in it. Blue represents CNN, orange is Fox News, and green is The New York Times. full_width: yes left: no bar: src: /assets/news-barplot-average-scores.png alt: Sentiment scores over time caption: Bar plots of average sentiment scores separated by model and candidate. left: no full_width: yes avg_over_time: src: /assets/average-4wk-scores-over-time-top-6.png alt: Sentiment scores over time caption: Line plots of sentiment scores separated by model and candidate." left: no full_width: yes avg_w_debates: src: /assets/average-4wk-scores-over-time-top-6-with-debates.png alt: Sentiment scores over time caption: Line plots of sentiment scores separated by model and candidate with debates superimposed over. left: no full_width: yes ---
_DSC4896
Click the photo to view photo credits.
With presidential primaries just around the corner, I thought it would be interesting to see if I could tell if there is a consistent bias toward one candidate or another. Could I quantitatively show that Fox News has more favorable headlines about Trump and CNN showing the opposite? The ideal news source is unbiased and not focusing all of their attention on one candidate; however we live in a time where "fake news" has entered everyone's daily vernacular. Unfortunately, there is scorn going both ways between liberals and conservatives with both claiming that their side knows the truth and lambasting the other side for being deceived and following villainous leaders. I gathered thousands of headlines from CNN, Fox News, and The New York Times that contain the keywords Trump, Biden, Sanders, Warren, Harris, or Buttigieg. I had to exclude many headlines that contained the names of multiple candidates because it would require making multiple models that are each tailored to one single candidate. Here are a few instances that have contain different candidates in the same headline that would make it difficult to measure a single sentiment for each candidate. * *Here's why Trump keeps pumping up Bernie Sanders* * *Buttigieg on Trump: 'Senate is the jury today, but we are the jury tomorrow'* * *Elizabeth Warren sought to 'raise a concern' with Bernie Sanders in post-debate exchange, Sanders campaign says* For this reason I decided to drop all headlines with the names of multiple candidates for this analysis. Thankfully, I still ended up with over 5,000 articles. Take a look at the distribution of articles for each candidate and for each news source. <|linear|> <|log|> Trump is by far the most talked-about candidate and for good reason: he is the sitting president and the sole republican candidate. After Trump in the ranking goes Biden, then Sanders and Warren are about the same then finally Harris and Buttigieg. I was surprised at the sheer volume of CNN articles and also The New York Times' tiny quantity. # Sentiment Analysis Models I used 3 different sentiment analysis models: two of which were pre-made packages. VADER and TextBlob are python packages that offer sentiment analysis trained on different subjects. VADER is a lexicon approach based off of social media tweets that contain emoticons and slang. TextBlob is a Naive Bayes approach trained on IMDB review data. My model is an LSTM with a base language model based off of the [AWD-LSTM](https://arxiv.org/abs/1708.02182). I then trained its language model on news articles. Following that, I trained it on hand-labeled (by me 😤) article headlines. Here are the average scores for each candidate. <|bar|> And then looking average scores over time. <|avg_over_time|> I should also note that these scores have been smoothed by a sliding average with a window size of 4 weeks. Without smoothing it looks much more chaotic. Smoothing hopefully shows longterm trends. Even with smoothing it is a bit hard to tell if there are any consistent trends. The next thing I tried was to superimpose debate dates onto the democratic candidates to see if the candidate's performance could be seen after a debate. In some cases, there does seem to be a rise or drop in scores after a debate, but whether they are correlated remains unknown. <|avg_w_debates|>