Add BERTopic model
Browse files- README.md +159 -0
- config.json +15 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
tags:
|
4 |
+
- bertopic
|
5 |
+
library_name: bertopic
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
# cnn_dailymail_108_50000_25000_validation
|
10 |
+
|
11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
13 |
+
|
14 |
+
## Usage
|
15 |
+
|
16 |
+
To use this model, please install BERTopic:
|
17 |
+
|
18 |
+
```
|
19 |
+
pip install -U bertopic
|
20 |
+
```
|
21 |
+
|
22 |
+
You can use the model as follows:
|
23 |
+
|
24 |
+
```python
|
25 |
+
from bertopic import BERTopic
|
26 |
+
topic_model = BERTopic.load("KingKazma/cnn_dailymail_108_50000_25000_validation")
|
27 |
+
|
28 |
+
topic_model.get_topic_info()
|
29 |
+
```
|
30 |
+
|
31 |
+
## Topic overview
|
32 |
+
|
33 |
+
* Number of topics: 92
|
34 |
+
* Number of training documents: 13368
|
35 |
+
|
36 |
+
<details>
|
37 |
+
<summary>Click here for an overview of all topics.</summary>
|
38 |
+
|
39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
40 |
+
|----------|----------------|-----------------|-------|
|
41 |
+
| -1 | said - police - one - year - also | 5 | -1_said_police_one_year |
|
42 |
+
| 0 | league - game - player - goal - season | 4918 | 0_league_game_player_goal |
|
43 |
+
| 1 | isis - syria - islamic - group - iraq | 2700 | 1_isis_syria_islamic_group |
|
44 |
+
| 2 | dog - animal - elephant - bear - cat | 415 | 2_dog_animal_elephant_bear |
|
45 |
+
| 3 | labour - mr - party - election - cameron | 386 | 3_labour_mr_party_election |
|
46 |
+
| 4 | flight - plane - aircraft - pilot - crash | 340 | 4_flight_plane_aircraft_pilot |
|
47 |
+
| 5 | hair - fashion - dress - look - model | 248 | 5_hair_fashion_dress_look |
|
48 |
+
| 6 | car - driver - driving - road - police | 227 | 6_car_driver_driving_road |
|
49 |
+
| 7 | food - cent - sugar - health - per | 221 | 7_food_cent_sugar_health |
|
50 |
+
| 8 | police - officer - shooting - shot - said | 215 | 8_police_officer_shooting_shot |
|
51 |
+
| 9 | clinton - email - obama - president - state | 213 | 9_clinton_email_obama_president |
|
52 |
+
| 10 | cricket - england - cup - world - zealand | 191 | 10_cricket_england_cup_world |
|
53 |
+
| 11 | property - house - home - room - price | 184 | 11_property_house_home_room |
|
54 |
+
| 12 | fight - pacquiao - mayweather - manny - floyd | 171 | 12_fight_pacquiao_mayweather_manny |
|
55 |
+
| 13 | hamilton - mercedes - race - prix - rosberg | 135 | 13_hamilton_mercedes_race_prix |
|
56 |
+
| 14 | baby - hospital - birth - mother - child | 127 | 14_baby_hospital_birth_mother |
|
57 |
+
| 15 | murray - wells - tennis - andy - match | 127 | 15_murray_wells_tennis_andy |
|
58 |
+
| 16 | eclipse - earth - solar - sun - planet | 102 | 16_eclipse_earth_solar_sun |
|
59 |
+
| 17 | police - abuse - sex - sexual - child | 98 | 17_police_abuse_sex_sexual |
|
60 |
+
| 18 | apple - watch - device - user - google | 96 | 18_apple_watch_device_user |
|
61 |
+
| 19 | netanyahu - iran - nuclear - israel - israeli | 83 | 19_netanyahu_iran_nuclear_israel |
|
62 |
+
| 20 | putin - russian - nemtsov - moscow - russia | 82 | 20_putin_russian_nemtsov_moscow |
|
63 |
+
| 21 | weight - fat - diet - size - stone | 81 | 21_weight_fat_diet_size |
|
64 |
+
| 22 | race - armstrong - doping - world - tour | 78 | 22_race_armstrong_doping_world |
|
65 |
+
| 23 | court - fraud - money - bank - mr | 76 | 23_court_fraud_money_bank |
|
66 |
+
| 24 | cheltenham - hurdle - horse - race - jockey | 74 | 24_cheltenham_hurdle_horse_race |
|
67 |
+
| 25 | mcilroy - round - masters - woods - golf | 72 | 25_mcilroy_round_masters_woods |
|
68 |
+
| 26 | prince - charles - royal - duchess - camilla | 72 | 26_prince_charles_royal_duchess |
|
69 |
+
| 27 | fraternity - university - sae - chapter - oklahoma | 68 | 27_fraternity_university_sae_chapter |
|
70 |
+
| 28 | chan - sukumaran - bali - indonesian - mack | 65 | 28_chan_sukumaran_bali_indonesian |
|
71 |
+
| 29 | ebola - sierra - virus - leone - disease | 64 | 29_ebola_sierra_virus_leone |
|
72 |
+
| 30 | school - teacher - student - girl - sexual | 58 | 30_school_teacher_student_girl |
|
73 |
+
| 31 | fire - building - explosion - blaze - firefighter | 52 | 31_fire_building_explosion_blaze |
|
74 |
+
| 32 | nfl - borland - football - 49ers - season | 52 | 32_nfl_borland_football_49ers |
|
75 |
+
| 33 | clarkson - bbc - gear - top - jeremy | 50 | 33_clarkson_bbc_gear_top |
|
76 |
+
| 34 | ski - skier - mountain - avalanche - rock | 47 | 34_ski_skier_mountain_avalanche |
|
77 |
+
| 35 | patient - nhs - ae - cancer - hospital | 46 | 35_patient_nhs_ae_cancer |
|
78 |
+
| 36 | india - rape - documentary - indian - singh | 45 | 36_india_rape_documentary_indian |
|
79 |
+
| 37 | mr - death - court - emery - miss | 43 | 37_mr_death_court_emery |
|
80 |
+
| 38 | show - corden - host - stewart - williams | 42 | 38_show_corden_host_stewart |
|
81 |
+
| 39 | car - vehicle - electric - cars - tesla | 40 | 39_car_vehicle_electric_cars |
|
82 |
+
| 40 | school - child - education - porn - sex | 38 | 40_school_child_education_porn |
|
83 |
+
| 41 | boko - haram - nigeria - nigerian - nigerias | 37 | 41_boko_haram_nigeria_nigerian |
|
84 |
+
| 42 | marijuana - drug - cannabis - colorado - lsd | 34 | 42_marijuana_drug_cannabis_colorado |
|
85 |
+
| 43 | law - indiana - gay - marriage - religious | 33 | 43_law_indiana_gay_marriage |
|
86 |
+
| 44 | ferguson - department - police - justice - report | 32 | 44_ferguson_department_police_justice |
|
87 |
+
| 45 | image - photographer - photography - photograph - photo | 31 | 45_image_photographer_photography_photograph |
|
88 |
+
| 46 | snow - inch - winter - ice - storm | 30 | 46_snow_inch_winter_ice |
|
89 |
+
| 47 | basketball - ncaa - coach - tournament - game | 30 | 47_basketball_ncaa_coach_tournament |
|
90 |
+
| 48 | tsarnaev - boston - dzhokhar - tamerlan - tsarnaevs | 30 | 48_tsarnaev_boston_dzhokhar_tamerlan |
|
91 |
+
| 49 | durst - dursts - berman - orleans - robert | 29 | 49_durst_dursts_berman_orleans |
|
92 |
+
| 50 | jesus - ancient - stone - cave - circle | 29 | 50_jesus_ancient_stone_cave |
|
93 |
+
| 51 | zayn - band - direction - singer - dance | 29 | 51_zayn_band_direction_singer |
|
94 |
+
| 52 | film - movie - vivian - hollywood - script | 23 | 52_film_movie_vivian_hollywood |
|
95 |
+
| 53 | korean - korea - kim - north - lippert | 23 | 53_korean_korea_kim_north |
|
96 |
+
| 54 | weather - rain - temperature - snow - today | 23 | 54_weather_rain_temperature_snow |
|
97 |
+
| 55 | robbery - woodger - store - cash - police | 22 | 55_robbery_woodger_store_cash |
|
98 |
+
| 56 | parade - patricks - st - irish - green | 21 | 56_parade_patricks_st_irish |
|
99 |
+
| 57 | secret - clancy - service - agent - white | 20 | 57_secret_clancy_service_agent |
|
100 |
+
| 58 | hernandez - lloyd - jenkins - hernandezs - lloyds | 20 | 58_hernandez_lloyd_jenkins_hernandezs |
|
101 |
+
| 59 | nazi - anne - nazis - war - camp | 20 | 59_nazi_anne_nazis_war |
|
102 |
+
| 60 | snowden - intelligence - gchq - security - agency | 18 | 60_snowden_intelligence_gchq_security |
|
103 |
+
| 61 | huang - chinese - china - mingxi - chen | 17 | 61_huang_chinese_china_mingxi |
|
104 |
+
| 62 | wedding - married - marlee - platt - woodyard | 17 | 62_wedding_married_marlee_platt |
|
105 |
+
| 63 | drug - cocaine - jailed - cannabis - tobacco | 17 | 63_drug_cocaine_jailed_cannabis |
|
106 |
+
| 64 | cnn - transcript - student - news - roll | 17 | 64_cnn_transcript_student_news |
|
107 |
+
| 65 | pope - francis - vatican - naples - pontiff | 17 | 65_pope_francis_vatican_naples |
|
108 |
+
| 66 | richard - iii - leicester - king - iiis | 17 | 66_richard_iii_leicester_king |
|
109 |
+
| 67 | chinese - tourist - temple - thailand - buddhist | 16 | 67_chinese_tourist_temple_thailand |
|
110 |
+
| 68 | china - chinese - internet - chai - stopera | 16 | 68_china_chinese_internet_chai |
|
111 |
+
| 69 | execution - lethal - gissendaner - injection - drug | 16 | 69_execution_lethal_gissendaner_injection |
|
112 |
+
| 70 | woman - marriage - men - attractive - chalmers | 15 | 70_woman_marriage_men_attractive |
|
113 |
+
| 71 | vanuatu - cyclone - vila - port - pam | 15 | 71_vanuatu_cyclone_vila_port |
|
114 |
+
| 72 | poldark - turner - demelza - aidan - drama | 15 | 72_poldark_turner_demelza_aidan |
|
115 |
+
| 73 | point - rebound - scored - points - harden | 14 | 73_point_rebound_scored_points |
|
116 |
+
| 74 | rail - calais - parking - migrant - dickens | 13 | 74_rail_calais_parking_migrant |
|
117 |
+
| 75 | johnson - student - virginia - charlottesville - uva | 13 | 75_johnson_student_virginia_charlottesville |
|
118 |
+
| 76 | cuba - havana - cuban - rousseff - us | 13 | 76_cuba_havana_cuban_rousseff |
|
119 |
+
| 77 | paris - attack - synagogue - hebdo - charlie | 13 | 77_paris_attack_synagogue_hebdo |
|
120 |
+
| 78 | duckenfield - mr - gate - hillsborough - disaster | 12 | 78_duckenfield_mr_gate_hillsborough |
|
121 |
+
| 79 | gordon - bobbi - kristina - phil - dr | 12 | 79_gordon_bobbi_kristina_phil |
|
122 |
+
| 80 | knox - sollecito - kercher - raffaele - amanda | 12 | 80_knox_sollecito_kercher_raffaele |
|
123 |
+
| 81 | coin - medal - war - auction - cross | 12 | 81_coin_medal_war_auction |
|
124 |
+
| 82 | starbucks - schultz - race - racial - campaign | 12 | 82_starbucks_schultz_race_racial |
|
125 |
+
| 83 | cosby - cosbys - thompson - bill - welles | 11 | 83_cosby_cosbys_thompson_bill |
|
126 |
+
| 84 | jeffs - flds - rivette - compound - speer | 10 | 84_jeffs_flds_rivette_compound |
|
127 |
+
| 85 | selma - alabama - march - bridge - civil | 8 | 85_selma_alabama_march_bridge |
|
128 |
+
| 86 | jobs - naomi - fortune - redballoon - bn | 8 | 86_jobs_naomi_fortune_redballoon |
|
129 |
+
| 87 | brain - object - retina - neuron - word | 8 | 87_brain_object_retina_neuron |
|
130 |
+
| 88 | netflix - tv - content - streaming - screen | 8 | 88_netflix_tv_content_streaming |
|
131 |
+
| 89 | social - user - tweet - twitter - tool | 7 | 89_social_user_tweet_twitter |
|
132 |
+
| 90 | cunard - bird - darshan - ship - liner | 6 | 90_cunard_bird_darshan_ship |
|
133 |
+
|
134 |
+
</details>
|
135 |
+
|
136 |
+
## Training hyperparameters
|
137 |
+
|
138 |
+
* calculate_probabilities: True
|
139 |
+
* language: english
|
140 |
+
* low_memory: False
|
141 |
+
* min_topic_size: 10
|
142 |
+
* n_gram_range: (1, 1)
|
143 |
+
* nr_topics: None
|
144 |
+
* seed_topic_list: None
|
145 |
+
* top_n_words: 10
|
146 |
+
* verbose: False
|
147 |
+
|
148 |
+
## Framework versions
|
149 |
+
|
150 |
+
* Numpy: 1.22.4
|
151 |
+
* HDBSCAN: 0.8.33
|
152 |
+
* UMAP: 0.5.3
|
153 |
+
* Pandas: 1.5.3
|
154 |
+
* Scikit-Learn: 1.2.2
|
155 |
+
* Sentence-transformers: 2.2.2
|
156 |
+
* Transformers: 4.31.0
|
157 |
+
* Numba: 0.57.1
|
158 |
+
* Plotly: 5.13.1
|
159 |
+
* Python: 3.10.12
|
config.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"calculate_probabilities": true,
|
3 |
+
"language": "english",
|
4 |
+
"low_memory": false,
|
5 |
+
"min_topic_size": 10,
|
6 |
+
"n_gram_range": [
|
7 |
+
1,
|
8 |
+
1
|
9 |
+
],
|
10 |
+
"nr_topics": null,
|
11 |
+
"seed_topic_list": null,
|
12 |
+
"top_n_words": 10,
|
13 |
+
"verbose": false,
|
14 |
+
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
|
15 |
+
}
|
topic_embeddings.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a94ced464e6d3ca872297f1100d6f30cea3734b103933e861d74b3df0f23f6ac
|
3 |
+
size 141400
|
topics.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|