KingKazma commited on
Commit
2efb912
·
1 Parent(s): c4e15ee

Add BERTopic model

Browse files
Files changed (4) hide show
  1. README.md +159 -0
  2. config.json +15 -0
  3. topic_embeddings.safetensors +3 -0
  4. topics.json +0 -0
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # cnn_dailymail_108_50000_25000_validation
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("KingKazma/cnn_dailymail_108_50000_25000_validation")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 92
34
+ * Number of training documents: 13368
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | said - police - one - year - also | 5 | -1_said_police_one_year |
42
+ | 0 | league - game - player - goal - season | 4918 | 0_league_game_player_goal |
43
+ | 1 | isis - syria - islamic - group - iraq | 2700 | 1_isis_syria_islamic_group |
44
+ | 2 | dog - animal - elephant - bear - cat | 415 | 2_dog_animal_elephant_bear |
45
+ | 3 | labour - mr - party - election - cameron | 386 | 3_labour_mr_party_election |
46
+ | 4 | flight - plane - aircraft - pilot - crash | 340 | 4_flight_plane_aircraft_pilot |
47
+ | 5 | hair - fashion - dress - look - model | 248 | 5_hair_fashion_dress_look |
48
+ | 6 | car - driver - driving - road - police | 227 | 6_car_driver_driving_road |
49
+ | 7 | food - cent - sugar - health - per | 221 | 7_food_cent_sugar_health |
50
+ | 8 | police - officer - shooting - shot - said | 215 | 8_police_officer_shooting_shot |
51
+ | 9 | clinton - email - obama - president - state | 213 | 9_clinton_email_obama_president |
52
+ | 10 | cricket - england - cup - world - zealand | 191 | 10_cricket_england_cup_world |
53
+ | 11 | property - house - home - room - price | 184 | 11_property_house_home_room |
54
+ | 12 | fight - pacquiao - mayweather - manny - floyd | 171 | 12_fight_pacquiao_mayweather_manny |
55
+ | 13 | hamilton - mercedes - race - prix - rosberg | 135 | 13_hamilton_mercedes_race_prix |
56
+ | 14 | baby - hospital - birth - mother - child | 127 | 14_baby_hospital_birth_mother |
57
+ | 15 | murray - wells - tennis - andy - match | 127 | 15_murray_wells_tennis_andy |
58
+ | 16 | eclipse - earth - solar - sun - planet | 102 | 16_eclipse_earth_solar_sun |
59
+ | 17 | police - abuse - sex - sexual - child | 98 | 17_police_abuse_sex_sexual |
60
+ | 18 | apple - watch - device - user - google | 96 | 18_apple_watch_device_user |
61
+ | 19 | netanyahu - iran - nuclear - israel - israeli | 83 | 19_netanyahu_iran_nuclear_israel |
62
+ | 20 | putin - russian - nemtsov - moscow - russia | 82 | 20_putin_russian_nemtsov_moscow |
63
+ | 21 | weight - fat - diet - size - stone | 81 | 21_weight_fat_diet_size |
64
+ | 22 | race - armstrong - doping - world - tour | 78 | 22_race_armstrong_doping_world |
65
+ | 23 | court - fraud - money - bank - mr | 76 | 23_court_fraud_money_bank |
66
+ | 24 | cheltenham - hurdle - horse - race - jockey | 74 | 24_cheltenham_hurdle_horse_race |
67
+ | 25 | mcilroy - round - masters - woods - golf | 72 | 25_mcilroy_round_masters_woods |
68
+ | 26 | prince - charles - royal - duchess - camilla | 72 | 26_prince_charles_royal_duchess |
69
+ | 27 | fraternity - university - sae - chapter - oklahoma | 68 | 27_fraternity_university_sae_chapter |
70
+ | 28 | chan - sukumaran - bali - indonesian - mack | 65 | 28_chan_sukumaran_bali_indonesian |
71
+ | 29 | ebola - sierra - virus - leone - disease | 64 | 29_ebola_sierra_virus_leone |
72
+ | 30 | school - teacher - student - girl - sexual | 58 | 30_school_teacher_student_girl |
73
+ | 31 | fire - building - explosion - blaze - firefighter | 52 | 31_fire_building_explosion_blaze |
74
+ | 32 | nfl - borland - football - 49ers - season | 52 | 32_nfl_borland_football_49ers |
75
+ | 33 | clarkson - bbc - gear - top - jeremy | 50 | 33_clarkson_bbc_gear_top |
76
+ | 34 | ski - skier - mountain - avalanche - rock | 47 | 34_ski_skier_mountain_avalanche |
77
+ | 35 | patient - nhs - ae - cancer - hospital | 46 | 35_patient_nhs_ae_cancer |
78
+ | 36 | india - rape - documentary - indian - singh | 45 | 36_india_rape_documentary_indian |
79
+ | 37 | mr - death - court - emery - miss | 43 | 37_mr_death_court_emery |
80
+ | 38 | show - corden - host - stewart - williams | 42 | 38_show_corden_host_stewart |
81
+ | 39 | car - vehicle - electric - cars - tesla | 40 | 39_car_vehicle_electric_cars |
82
+ | 40 | school - child - education - porn - sex | 38 | 40_school_child_education_porn |
83
+ | 41 | boko - haram - nigeria - nigerian - nigerias | 37 | 41_boko_haram_nigeria_nigerian |
84
+ | 42 | marijuana - drug - cannabis - colorado - lsd | 34 | 42_marijuana_drug_cannabis_colorado |
85
+ | 43 | law - indiana - gay - marriage - religious | 33 | 43_law_indiana_gay_marriage |
86
+ | 44 | ferguson - department - police - justice - report | 32 | 44_ferguson_department_police_justice |
87
+ | 45 | image - photographer - photography - photograph - photo | 31 | 45_image_photographer_photography_photograph |
88
+ | 46 | snow - inch - winter - ice - storm | 30 | 46_snow_inch_winter_ice |
89
+ | 47 | basketball - ncaa - coach - tournament - game | 30 | 47_basketball_ncaa_coach_tournament |
90
+ | 48 | tsarnaev - boston - dzhokhar - tamerlan - tsarnaevs | 30 | 48_tsarnaev_boston_dzhokhar_tamerlan |
91
+ | 49 | durst - dursts - berman - orleans - robert | 29 | 49_durst_dursts_berman_orleans |
92
+ | 50 | jesus - ancient - stone - cave - circle | 29 | 50_jesus_ancient_stone_cave |
93
+ | 51 | zayn - band - direction - singer - dance | 29 | 51_zayn_band_direction_singer |
94
+ | 52 | film - movie - vivian - hollywood - script | 23 | 52_film_movie_vivian_hollywood |
95
+ | 53 | korean - korea - kim - north - lippert | 23 | 53_korean_korea_kim_north |
96
+ | 54 | weather - rain - temperature - snow - today | 23 | 54_weather_rain_temperature_snow |
97
+ | 55 | robbery - woodger - store - cash - police | 22 | 55_robbery_woodger_store_cash |
98
+ | 56 | parade - patricks - st - irish - green | 21 | 56_parade_patricks_st_irish |
99
+ | 57 | secret - clancy - service - agent - white | 20 | 57_secret_clancy_service_agent |
100
+ | 58 | hernandez - lloyd - jenkins - hernandezs - lloyds | 20 | 58_hernandez_lloyd_jenkins_hernandezs |
101
+ | 59 | nazi - anne - nazis - war - camp | 20 | 59_nazi_anne_nazis_war |
102
+ | 60 | snowden - intelligence - gchq - security - agency | 18 | 60_snowden_intelligence_gchq_security |
103
+ | 61 | huang - chinese - china - mingxi - chen | 17 | 61_huang_chinese_china_mingxi |
104
+ | 62 | wedding - married - marlee - platt - woodyard | 17 | 62_wedding_married_marlee_platt |
105
+ | 63 | drug - cocaine - jailed - cannabis - tobacco | 17 | 63_drug_cocaine_jailed_cannabis |
106
+ | 64 | cnn - transcript - student - news - roll | 17 | 64_cnn_transcript_student_news |
107
+ | 65 | pope - francis - vatican - naples - pontiff | 17 | 65_pope_francis_vatican_naples |
108
+ | 66 | richard - iii - leicester - king - iiis | 17 | 66_richard_iii_leicester_king |
109
+ | 67 | chinese - tourist - temple - thailand - buddhist | 16 | 67_chinese_tourist_temple_thailand |
110
+ | 68 | china - chinese - internet - chai - stopera | 16 | 68_china_chinese_internet_chai |
111
+ | 69 | execution - lethal - gissendaner - injection - drug | 16 | 69_execution_lethal_gissendaner_injection |
112
+ | 70 | woman - marriage - men - attractive - chalmers | 15 | 70_woman_marriage_men_attractive |
113
+ | 71 | vanuatu - cyclone - vila - port - pam | 15 | 71_vanuatu_cyclone_vila_port |
114
+ | 72 | poldark - turner - demelza - aidan - drama | 15 | 72_poldark_turner_demelza_aidan |
115
+ | 73 | point - rebound - scored - points - harden | 14 | 73_point_rebound_scored_points |
116
+ | 74 | rail - calais - parking - migrant - dickens | 13 | 74_rail_calais_parking_migrant |
117
+ | 75 | johnson - student - virginia - charlottesville - uva | 13 | 75_johnson_student_virginia_charlottesville |
118
+ | 76 | cuba - havana - cuban - rousseff - us | 13 | 76_cuba_havana_cuban_rousseff |
119
+ | 77 | paris - attack - synagogue - hebdo - charlie | 13 | 77_paris_attack_synagogue_hebdo |
120
+ | 78 | duckenfield - mr - gate - hillsborough - disaster | 12 | 78_duckenfield_mr_gate_hillsborough |
121
+ | 79 | gordon - bobbi - kristina - phil - dr | 12 | 79_gordon_bobbi_kristina_phil |
122
+ | 80 | knox - sollecito - kercher - raffaele - amanda | 12 | 80_knox_sollecito_kercher_raffaele |
123
+ | 81 | coin - medal - war - auction - cross | 12 | 81_coin_medal_war_auction |
124
+ | 82 | starbucks - schultz - race - racial - campaign | 12 | 82_starbucks_schultz_race_racial |
125
+ | 83 | cosby - cosbys - thompson - bill - welles | 11 | 83_cosby_cosbys_thompson_bill |
126
+ | 84 | jeffs - flds - rivette - compound - speer | 10 | 84_jeffs_flds_rivette_compound |
127
+ | 85 | selma - alabama - march - bridge - civil | 8 | 85_selma_alabama_march_bridge |
128
+ | 86 | jobs - naomi - fortune - redballoon - bn | 8 | 86_jobs_naomi_fortune_redballoon |
129
+ | 87 | brain - object - retina - neuron - word | 8 | 87_brain_object_retina_neuron |
130
+ | 88 | netflix - tv - content - streaming - screen | 8 | 88_netflix_tv_content_streaming |
131
+ | 89 | social - user - tweet - twitter - tool | 7 | 89_social_user_tweet_twitter |
132
+ | 90 | cunard - bird - darshan - ship - liner | 6 | 90_cunard_bird_darshan_ship |
133
+
134
+ </details>
135
+
136
+ ## Training hyperparameters
137
+
138
+ * calculate_probabilities: True
139
+ * language: english
140
+ * low_memory: False
141
+ * min_topic_size: 10
142
+ * n_gram_range: (1, 1)
143
+ * nr_topics: None
144
+ * seed_topic_list: None
145
+ * top_n_words: 10
146
+ * verbose: False
147
+
148
+ ## Framework versions
149
+
150
+ * Numpy: 1.22.4
151
+ * HDBSCAN: 0.8.33
152
+ * UMAP: 0.5.3
153
+ * Pandas: 1.5.3
154
+ * Scikit-Learn: 1.2.2
155
+ * Sentence-transformers: 2.2.2
156
+ * Transformers: 4.31.0
157
+ * Numba: 0.57.1
158
+ * Plotly: 5.13.1
159
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": true,
3
+ "language": "english",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": false,
14
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
15
+ }
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a94ced464e6d3ca872297f1100d6f30cea3734b103933e861d74b3df0f23f6ac
3
+ size 141400
topics.json ADDED
The diff for this file is too large to render. See raw diff