File size: 12,102 Bytes
ffd9d26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
1. Title: Online News Popularity

2. Source Information
    -- Creators: Kelwin Fernandes (kafc ‘@’ inesctec.pt, kelwinfc ’@’ gmail.com),
                 Pedro Vinagre (pedro.vinagre.sousa ’@’ gmail.com) and
                 Pedro Sernadela
   -- Donor: Kelwin Fernandes (kafc ’@’ inesctec.pt, kelwinfc '@' gmail.com)
   -- Date: May, 2015

3. Past Usage:
    1. K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
       Support System for Predicting the Popularity of Online News. Proceedings
       of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
       September, Coimbra, Portugal.

       -- Results: 
          -- Binary classification as popular vs unpopular using a decision
             threshold of 1400 social interactions.
          -- Experiments with different models: Random Forest (best model),
             Adaboost, SVM, KNN and Naïve Bayes.
          -- Recorded 67% of accuracy and 0.73 of AUC.
    - Predicted attribute: online news popularity (boolean)

4. Relevant Information:
   -- The articles were published by Mashable (www.mashable.com) and their
      content as the rights to reproduce it belongs to them. Hence, this
      dataset does not share the original content but some statistics
      associated with it. The original content be publicly accessed and
      retrieved using the provided urls.
   -- Acquisition date: January 8, 2015
   -- The estimated relative performance values were estimated by the authors
      using a Random Forest classifier and a rolling windows as assessment
      method.  See their article for more details on how the relative
      performance values were set.

5. Number of Instances: 39797 

6. Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 
                             1 goal field)

7. Attribute Information:
     0. url:                           URL of the article
     1. timedelta:                     Days between the article publication and
                                       the dataset acquisition
     2. n_tokens_title:                Number of words in the title
     3. n_tokens_content:              Number of words in the content
     4. n_unique_tokens:               Rate of unique words in the content
     5. n_non_stop_words:              Rate of non-stop words in the content
     6. n_non_stop_unique_tokens:      Rate of unique non-stop words in the
                                       content
     7. num_hrefs:                     Number of links
     8. num_self_hrefs:                Number of links to other articles
                                       published by Mashable
     9. num_imgs:                      Number of images
    10. num_videos:                    Number of videos
    11. average_token_length:          Average length of the words in the
                                       content
    12. num_keywords:                  Number of keywords in the metadata
    13. data_channel_is_lifestyle:     Is data channel 'Lifestyle'?
    14. data_channel_is_entertainment: Is data channel 'Entertainment'?
    15. data_channel_is_bus:           Is data channel 'Business'?
    16. data_channel_is_socmed:        Is data channel 'Social Media'?
    17. data_channel_is_tech:          Is data channel 'Tech'?
    18. data_channel_is_world:         Is data channel 'World'?
    19. kw_min_min:                    Worst keyword (min. shares)
    20. kw_max_min:                    Worst keyword (max. shares)
    21. kw_avg_min:                    Worst keyword (avg. shares)
    22. kw_min_max:                    Best keyword (min. shares)
    23. kw_max_max:                    Best keyword (max. shares)
    24. kw_avg_max:                    Best keyword (avg. shares)
    25. kw_min_avg:                    Avg. keyword (min. shares)
    26. kw_max_avg:                    Avg. keyword (max. shares)
    27. kw_avg_avg:                    Avg. keyword (avg. shares)
    28. self_reference_min_shares:     Min. shares of referenced articles in
                                       Mashable
    29. self_reference_max_shares:     Max. shares of referenced articles in
                                       Mashable
    30. self_reference_avg_sharess:    Avg. shares of referenced articles in
                                       Mashable
    31. weekday_is_monday:             Was the article published on a Monday?
    32. weekday_is_tuesday:            Was the article published on a Tuesday?
    33. weekday_is_wednesday:          Was the article published on a Wednesday?
    34. weekday_is_thursday:           Was the article published on a Thursday?
    35. weekday_is_friday:             Was the article published on a Friday?
    36. weekday_is_saturday:           Was the article published on a Saturday?
    37. weekday_is_sunday:             Was the article published on a Sunday?
    38. is_weekend:                    Was the article published on the weekend?
    39. LDA_00:                        Closeness to LDA topic 0
    40. LDA_01:                        Closeness to LDA topic 1
    41. LDA_02:                        Closeness to LDA topic 2
    42. LDA_03:                        Closeness to LDA topic 3
    43. LDA_04:                        Closeness to LDA topic 4
    44. global_subjectivity:           Text subjectivity
    45. global_sentiment_polarity:     Text sentiment polarity
    46. global_rate_positive_words:    Rate of positive words in the content
    47. global_rate_negative_words:    Rate of negative words in the content
    48. rate_positive_words:           Rate of positive words among non-neutral
                                       tokens
    49. rate_negative_words:           Rate of negative words among non-neutral
                                       tokens
    50. avg_positive_polarity:         Avg. polarity of positive words
    51. min_positive_polarity:         Min. polarity of positive words
    52. max_positive_polarity:         Max. polarity of positive words
    53. avg_negative_polarity:         Avg. polarity of negative  words
    54. min_negative_polarity:         Min. polarity of negative  words
    55. max_negative_polarity:         Max. polarity of negative  words
    56. title_subjectivity:            Title subjectivity
    57. title_sentiment_polarity:      Title polarity
    58. abs_title_subjectivity:        Absolute subjectivity level
    59. abs_title_sentiment_polarity:  Absolute polarity level
    60. shares:                        Number of shares (target)

8. Missing Attribute Values: None

9. Class Distribution: the class value (shares) is continuously valued. We
                       transformed the task into a binary task using a decision
                       threshold of 1400.

   Shares Value Range:   Number of Instances in Range:
   <  1400            18490
   >= 1400            21154


Summary Statistics:
                       Feature       Min          Max         Mean           SD
                     timedelta    8.0000     731.0000     354.5305     214.1611
                n_tokens_title    2.0000      23.0000      10.3987       2.1140
              n_tokens_content    0.0000    8474.0000     546.5147     471.1016
               n_unique_tokens    0.0000     701.0000       0.5482       3.5207
              n_non_stop_words    0.0000    1042.0000       0.9965       5.2312
      n_non_stop_unique_tokens    0.0000     650.0000       0.6892       3.2648
                     num_hrefs    0.0000     304.0000      10.8837      11.3319
                num_self_hrefs    0.0000     116.0000       3.2936       3.8551
                      num_imgs    0.0000     128.0000       4.5441       8.3093
                    num_videos    0.0000      91.0000       1.2499       4.1078
          average_token_length    0.0000       8.0415       4.5482       0.8444
                  num_keywords    1.0000      10.0000       7.2238       1.9091
     data_channel_is_lifestyle    0.0000       1.0000       0.0529       0.2239
 data_channel_is_entertainment    0.0000       1.0000       0.1780       0.3825
           data_channel_is_bus    0.0000       1.0000       0.1579       0.3646
        data_channel_is_socmed    0.0000       1.0000       0.0586       0.2349
          data_channel_is_tech    0.0000       1.0000       0.1853       0.3885
         data_channel_is_world    0.0000       1.0000       0.2126       0.4091
                    kw_min_min   -1.0000     377.0000      26.1068      69.6323
                    kw_max_min    0.0000  298400.0000    1153.9517    3857.9422
                    kw_avg_min   -1.0000   42827.8571     312.3670     620.7761
                    kw_min_max    0.0000  843300.0000   13612.3541   57985.2980
                    kw_max_max    0.0000  843300.0000  752324.0667  214499.4242
                    kw_avg_max    0.0000  843300.0000  259281.9381  135100.5433
                    kw_min_avg   -1.0000    3613.0398    1117.1466    1137.4426
                    kw_max_avg    0.0000  298400.0000    5657.2112    6098.7950
                    kw_avg_avg    0.0000   43567.6599    3135.8586    1318.1338
     self_reference_min_shares    0.0000  843300.0000    3998.7554   19738.4216
     self_reference_max_shares    0.0000  843300.0000   10329.2127   41027.0592
    self_reference_avg_sharess    0.0000  843300.0000    6401.6976   24211.0269
             weekday_is_monday    0.0000       1.0000       0.1680       0.3739
            weekday_is_tuesday    0.0000       1.0000       0.1864       0.3894
          weekday_is_wednesday    0.0000       1.0000       0.1875       0.3903
           weekday_is_thursday    0.0000       1.0000       0.1833       0.3869
             weekday_is_friday    0.0000       1.0000       0.1438       0.3509
           weekday_is_saturday    0.0000       1.0000       0.0619       0.2409
             weekday_is_sunday    0.0000       1.0000       0.0690       0.2535
                    is_weekend    0.0000       1.0000       0.1309       0.3373
                        LDA_00    0.0000       0.9270       0.1846       0.2630
                        LDA_01    0.0000       0.9259       0.1413       0.2197
                        LDA_02    0.0000       0.9200       0.2163       0.2821
                        LDA_03    0.0000       0.9265       0.2238       0.2952
                        LDA_04    0.0000       0.9272       0.2340       0.2892
           global_subjectivity    0.0000       1.0000       0.4434       0.1167
     global_sentiment_polarity   -0.3937       0.7278       0.1193       0.0969
    global_rate_positive_words    0.0000       0.1555       0.0396       0.0174
    global_rate_negative_words    0.0000       0.1849       0.0166       0.0108
           rate_positive_words    0.0000       1.0000       0.6822       0.1902
           rate_negative_words    0.0000       1.0000       0.2879       0.1562
         avg_positive_polarity    0.0000       1.0000       0.3538       0.1045
         min_positive_polarity    0.0000       1.0000       0.0954       0.0713
         max_positive_polarity    0.0000       1.0000       0.7567       0.2478
         avg_negative_polarity   -1.0000       0.0000      -0.2595       0.1277
         min_negative_polarity   -1.0000       0.0000      -0.5219       0.2903
         max_negative_polarity   -1.0000       0.0000      -0.1075       0.0954
            title_subjectivity    0.0000       1.0000       0.2824       0.3242
      title_sentiment_polarity   -1.0000       1.0000       0.0714       0.2654
        abs_title_subjectivity    0.0000       0.5000       0.3418       0.1888
  abs_title_sentiment_polarity    0.0000       1.0000       0.1561       0.2263

   
 Citation Request:
 
 Please include this citation if you plan to use this database: 
 
    K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
    Support System for Predicting the Popularity of Online News. Proceedings
    of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
    September, Coimbra, Portugal.