ctam8736 commited on
Commit
b056319
1 Parent(s): 14e6218

Add BERTopic model

Browse files
Files changed (4) hide show
  1. README.md +204 -0
  2. config.json +17 -0
  3. topic_embeddings.safetensors +3 -0
  4. topics.json +0 -0
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # bertopic-20-newsgroups
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("ctam8736/bertopic-20-newsgroups")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 135
34
+ * Number of training documents: 11314
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | article - information - subject - re - what | 10 | -1_article_information_subject_re |
42
+ | 0 | scsi - scsi2 - scsi1 - drives - bios | 3737 | 0_scsi_scsi2_scsi1_drives |
43
+ | 1 | nhl - puck - leafs - flyers - pitching | 976 | 1_nhl_puck_leafs_flyers |
44
+ | 2 | firearm - firearms - handgun - guns - gun | 918 | 2_firearm_firearms_handgun_guns |
45
+ | 3 | ford - honda - nissan - bmw - dealer | 409 | 3_ford_honda_nissan_bmw |
46
+ | 4 | encryption - encrypted - crypto - nsa - chip | 387 | 4_encryption_encrypted_crypto_nsa |
47
+ | 5 | atheism - atheist - atheists - christianity - belief | 377 | 5_atheism_atheist_atheists_christianity |
48
+ | 6 | hezbollah - gaza - lebanon - palestinians - lebanese | 342 | 6_hezbollah_gaza_lebanon_palestinians |
49
+ | 7 | window - x11r5 - openwindows - x11 - x11r4 | 249 | 7_window_x11r5_openwindows_x11 |
50
+ | 8 | modems - modem - mouse - ports - port | 243 | 8_modems_modem_mouse_ports |
51
+ | 9 | anonymity - anonymous - mailing - usenet - newsgroups | 151 | 9_anonymity_anonymous_mailing_usenet |
52
+ | 10 | armenians - armenia - armenian - azerbaijani - azerbaijan | 147 | 10_armenians_armenia_armenian_azerbaijani |
53
+ | 11 | clinton - stephanopoulos - secretary - president - congress | 135 | 11_clinton_stephanopoulos_secretary_president |
54
+ | 12 | os - windows - win32 - microsoft - win31 | 133 | 12_os_windows_win32_microsoft |
55
+ | 13 | diseases - disease - candida - infection - infections | 113 | 13_diseases_disease_candida_infection |
56
+ | 14 | superstition - msg - sensitivity - glutamate - causes | 100 | 14_superstition_msg_sensitivity_glutamate |
57
+ | 15 | laserjet - inkjet - printers - bubblejet - bubblejets | 86 | 15_laserjet_inkjet_printers_bubblejet |
58
+ | 16 | billboard - billboards - nasa - space - advertising | 75 | 16_billboard_billboards_nasa_space |
59
+ | 17 | radar - detectors - detector - detecting - radarjust | 68 | 17_radar_detectors_detector_detecting |
60
+ | 18 | speeding - speeds - mph - speed - driving | 64 | 18_speeding_speeds_mph_speed |
61
+ | 19 | ssto - moonbase - moon - lunar - billion | 63 | 19_ssto_moonbase_moon_lunar |
62
+ | 20 | station - nasa - redesign - space - shuttle | 61 | 20_station_nasa_redesign_space |
63
+ | 21 | eternity - afterlife - heaven - hell - judgement | 49 | 21_eternity_afterlife_heaven_hell |
64
+ | 22 | testament - manuscripts - scripture - bible - hebrew | 47 | 22_testament_manuscripts_scripture_bible |
65
+ | 23 | homosexuality - heterosexual - homosexual - homosexuals - gays | 45 | 23_homosexuality_heterosexual_homosexual_homosexuals |
66
+ | 24 | libertarians - libertarian - libertarianism - regulation - governments | 44 | 24_libertarians_libertarian_libertarianism_regulation |
67
+ | 25 | islamic - muslim - islam - muslims - koran | 44 | 25_islamic_muslim_islam_muslims |
68
+ | 26 | tax - taxes - vat - deficits - income | 44 | 26_tax_taxes_vat_deficits |
69
+ | 27 | oil - drain - engine - fuel - dumping | 44 | 27_oil_drain_engine_fuel |
70
+ | 28 | helmet - helmets - head - protection - gloves | 43 | 28_helmet_helmets_head_protection |
71
+ | 29 | fonts - font - ttfonts - truetype - printing | 42 | 29_fonts_font_ttfonts_truetype |
72
+ | 30 | morality - moral - morals - instinctive - immoral | 39 | 30_morality_moral_morals_instinctive |
73
+ | 31 | colormaps - colourmap - colormap - xalloccolor - cwcolormap | 39 | 31_colormaps_colourmap_colormap_xalloccolor |
74
+ | 32 | homosexuals - molesters - homosexual - homosexuality - pedophilia | 38 | 32_homosexuals_molesters_homosexual_homosexuality |
75
+ | 33 | migraine - migraines - headache - headaches - analgesics | 37 | 33_migraine_migraines_headache_headaches |
76
+ | 34 | resurrection - gospels - tomb - testament - jesuss | 37 | 34_resurrection_gospels_tomb_testament |
77
+ | 35 | graphics - copyright - images - siggraph - image | 37 | 35_graphics_copyright_images_siggraph |
78
+ | 36 | mormon - mormons - lds - brigham - utah | 35 | 36_mormon_mormons_lds_brigham |
79
+ | 37 | scientific - scipsychology - scientist - science - methodology | 34 | 37_scientific_scipsychology_scientist_science |
80
+ | 38 | tapes - tape - backup - copy - floppy | 34 | 38_tapes_tape_backup_copy |
81
+ | 39 | drugs - marijuana - drug - legalizing - legalization | 34 | 39_drugs_marijuana_drug_legalizing |
82
+ | 40 | punishment - punish - murder - penalty - murderer | 34 | 40_punishment_punish_murder_penalty |
83
+ | 41 | sphere - globe - radius - pointstruct - circle | 34 | 41_sphere_globe_radius_pointstruct |
84
+ | 42 | surgery - patients - hernia - massager - pain | 33 | 42_surgery_patients_hernia_massager |
85
+ | 43 | genocide - bosnia - atheism - serbs - christians | 32 | 43_genocide_bosnia_atheism_serbs |
86
+ | 44 | insurance - liability - insureyear - deductible - accident | 32 | 44_insurance_liability_insureyear_deductible |
87
+ | 45 | polygon - polygons - triangulation - hexagons - polyn | 30 | 45_polygon_polygons_triangulation_hexagons |
88
+ | 46 | spacecraft - galileo - galileos - mission - magellan | 29 | 46_spacecraft_galileo_galileos_mission |
89
+ | 47 | countersteering - countersteeringfaq - countersteer - riding - bikes | 29 | 47_countersteering_countersteeringfaq_countersteer_riding |
90
+ | 48 | antenna - antennas - transmitters - transmitting - radios | 28 | 48_antenna_antennas_transmitters_transmitting |
91
+ | 49 | canine - dogs - dog - spaniel - springer | 28 | 49_canine_dogs_dog_spaniel |
92
+ | 50 | batteries - battery - electrolyte - galvanized - zinc | 28 | 50_batteries_battery_electrolyte_galvanized |
93
+ | 51 | oscilloscope - scopes - scope - oscilliscopes - digital | 27 | 51_oscilloscope_scopes_scope_oscilliscopes |
94
+ | 52 | xgrabkey - definekeys - accelerators - accelerator - shiftkeyq | 27 | 52_xgrabkey_definekeys_accelerators_accelerator |
95
+ | 53 | protoncentaur - centaur - proton - accelerator - nuclear | 27 | 53_protoncentaur_centaur_proton_accelerator |
96
+ | 54 | telephone - dial - phone - call - lines | 26 | 54_telephone_dial_phone_call |
97
+ | 55 | marriages - wedding - vows - weddings - marriage | 25 | 55_marriages_wedding_vows_weddings |
98
+ | 56 | ibm - levels - level - nasa - software | 25 | 56_ibm_levels_level_nasa |
99
+ | 57 | nasa - aerospace - astronomy - spacecraft - astronomical | 24 | 57_nasa_aerospace_astronomy_spacecraft |
100
+ | 58 | motif - neosoft - unix - platforms - software | 24 | 58_motif_neosoft_unix_platforms |
101
+ | 59 | nuclear - cooling - reactor - tower - towers | 23 | 59_nuclear_cooling_reactor_tower |
102
+ | 60 | injuries - struck - snot - rocks - warningplease | 23 | 60_injuries_struck_snot_rocks |
103
+ | 61 | transmissions - shifter - automatics - autos - auto | 22 | 61_transmissions_shifter_automatics_autos |
104
+ | 62 | lzr1260 - printing - mwt9caxaxaxaxaxaxaxaxaxaxaxaxax - m9l0qaxaxaxaxaxaxaxaxaxaxaxaxaxax - mi68qaxaxaxaxaxaxaxaxaxaxaxaxaxax | 22 | 62_lzr1260_printing_mwt9caxaxaxaxaxaxaxaxaxaxaxaxax_m9l0qaxaxaxaxaxaxaxaxaxaxaxaxaxax |
105
+ | 63 | cview - files - directory - file - tmp | 21 | 63_cview_files_directory_file |
106
+ | 64 | immaculate - mary - marys - conception - catholics | 21 | 64_immaculate_mary_marys_conception |
107
+ | 65 | cryptology - cryptanalyst - crypt - cryptanalysis - ciphers | 20 | 65_cryptology_cryptanalyst_crypt_cryptanalysis |
108
+ | 66 | hotelco - hotels - resorts - hotel - tickets | 20 | 66_hotelco_hotels_resorts_hotel |
109
+ | 67 | 3dos - 3do - 3ds - 3d - 3dstudio | 20 | 67_3dos_3do_3ds_3d |
110
+ | 68 | comet - comets - jupiter - asteroids - jovian | 20 | 68_comet_comets_jupiter_asteroids |
111
+ | 69 | polishing - scratches - paint - rubbing - glaze | 20 | 69_polishing_scratches_paint_rubbing |
112
+ | 70 | newsgroup - groups - groupsplit - group - split | 20 | 70_newsgroup_groups_groupsplit_group |
113
+ | 71 | koresh - koreshs - david - sermon - biblical | 20 | 71_koresh_koreshs_david_sermon |
114
+ | 72 | parking - parked - liability - unsafe - stickers | 20 | 72_parking_parked_liability_unsafe |
115
+ | 73 | trumpet - tcp - windows - winqvtnet - winsock | 19 | 73_trumpet_tcp_windows_winqvtnet |
116
+ | 74 | freon - heater - coolant - r12 - vents | 19 | 74_freon_heater_coolant_r12 |
117
+ | 75 | sabbath - commandments - sunday - worship - church | 19 | 75_sabbath_commandments_sunday_worship |
118
+ | 76 | geekdom - computer - fourdcom - csws18icsunysbedu - psychnet | 19 | 76_geekdom_computer_fourdcom_csws18icsunysbedu |
119
+ | 77 | bosnia - serbs - sanctions - somalia - war | 18 | 77_bosnia_serbs_sanctions_somalia |
120
+ | 78 | soundblaster - midi - midimapper - soundexe - wavfiles | 18 | 78_soundblaster_midi_midimapper_soundexe |
121
+ | 79 | condo - remodeled - townhome - bedroom - rent | 18 | 79_condo_remodeled_townhome_bedroom |
122
+ | 80 | odometers - odometer - sensor - mileage - sensors | 18 | 80_odometers_odometer_sensor_mileage |
123
+ | 81 | joystick - joysticks - joyport - joyread - hardware | 17 | 81_joystick_joysticks_joyport_joyread |
124
+ | 82 | abortion - abortions - roe - proabortion - fetus | 17 | 82_abortion_abortions_roe_proabortion |
125
+ | 83 | seizures - seizure - allergies - corn - cereal | 17 | 83_seizures_seizure_allergies_corn |
126
+ | 84 | sobriety - sober - drinking - drink - drinks | 17 | 84_sobriety_sober_drinking_drink |
127
+ | 85 | nubus - lciiipowerpc - pds - powerpcs - powerpc | 17 | 85_nubus_lciiipowerpc_pds_powerpcs |
128
+ | 86 | mining - miners - minerals - miner - mineral | 17 | 86_mining_miners_minerals_miner |
129
+ | 87 | outlets - outlet - electrical - wiring - grounded | 16 | 87_outlets_outlet_electrical_wiring |
130
+ | 88 | rosicrucianum - rosicrucian - orders - order - organization | 16 | 88_rosicrucianum_rosicrucian_orders_order |
131
+ | 89 | tempest - shielding - surveillance - encryption - electromagnetic | 16 | 89_tempest_shielding_surveillance_encryption |
132
+ | 90 | monitor - monitors - screen - scrolling - display | 16 | 90_monitor_monitors_screen_scrolling |
133
+ | 91 | krillean - photographs - photography - kirlian - pictures | 16 | 91_krillean_photographs_photography_kirlian |
134
+ | 92 | scanner - scanners - scanning - scans - scanman | 16 | 92_scanner_scanners_scanning_scans |
135
+ | 93 | sexism - sexist - extramarital - islamic - marriage | 16 | 93_sexism_sexist_extramarital_islamic |
136
+ | 94 | noisy - noise - noises - rattled - quiets | 16 | 94_noisy_noise_noises_rattled |
137
+ | 95 | orion - astronomy - museum - prototype - space | 15 | 95_orion_astronomy_museum_prototype |
138
+ | 96 | easter - pagan - celebrating - feast - celebration | 15 | 96_easter_pagan_celebrating_feast |
139
+ | 97 | batf - assault - waco - blasting - blast | 15 | 97_batf_assault_waco_blasting |
140
+ | 98 | batchfile - ini - updating - file - winfileini | 15 | 98_batchfile_ini_updating_file |
141
+ | 99 | copyprotect - copying - protected - copy - protection | 15 | 99_copyprotect_copying_protected_copy |
142
+ | 100 | 42 - tiff - tiff6 - significance - universe | 14 | 100_42_tiff_tiff6_significance |
143
+ | 101 | stove - stoves - splitfires - splitfire - burns | 14 | 101_stove_stoves_splitfires_splitfire |
144
+ | 102 | automotive - backing - lights - corvette - reverse | 14 | 102_automotive_backing_lights_corvette |
145
+ | 103 | dock - docks - minidocks - portable - minidock | 14 | 103_dock_docks_minidocks_portable |
146
+ | 104 | cdaudio - stereo - audio - soundbase - speakers | 14 | 104_cdaudio_stereo_audio_soundbase |
147
+ | 105 | uv - flashlight - houselights - fluorescent - lamps | 14 | 105_uv_flashlight_houselights_fluorescent |
148
+ | 106 | papal - papacy - pope - popes - schism | 14 | 106_papal_papacy_pope_popes |
149
+ | 107 | scsi - quadra - quadras - quadraspecific - firmware | 14 | 107_scsi_quadra_quadras_quadraspecific |
150
+ | 108 | crohns - colitis - dietary - gastroenterology - diet | 13 | 108_crohns_colitis_dietary_gastroenterology |
151
+ | 109 | crashes - powerbook - plugged - corrupted - duos | 13 | 109_crashes_powerbook_plugged_corrupted |
152
+ | 110 | eyedness - handedness - righteye - righthandedness - eyes | 13 | 110_eyedness_handedness_righteye_righthandedness |
153
+ | 111 | wrench - pliers - tool - tools - srb | 13 | 111_wrench_pliers_tool_tools |
154
+ | 112 | scripture - scriptures - prophecy - revelation - revelations | 13 | 112_scripture_scriptures_prophecy_revelation |
155
+ | 113 | nikon - lens - lenses - olympus - 35mm | 13 | 113_nikon_lens_lenses_olympus |
156
+ | 114 | prosecution - suspects - encrypted - defendant - incriminate | 13 | 114_prosecution_suspects_encrypted_defendant |
157
+ | 115 | wheel - shaftdrives - wheelies - wheelie - shaftdrive | 12 | 115_wheel_shaftdrives_wheelies_wheelie |
158
+ | 116 | obesity - rebound - dieting - diet - metabolism | 12 | 116_obesity_rebound_dieting_diet |
159
+ | 117 | adl - adls - spying - fbi - investigation | 12 | 117_adl_adls_spying_fbi |
160
+ | 118 | lunar - moon - exploration - attend - conference | 12 | 118_lunar_moon_exploration_attend |
161
+ | 119 | draftees - draft - selective - military - abolished | 12 | 119_draftees_draft_selective_military |
162
+ | 120 | sunrise - sunset - daylight - algorithm - astronomical | 12 | 120_sunrise_sunset_daylight_algorithm |
163
+ | 121 | octopus - octopuses - octopi - squid - octapus | 12 | 121_octopus_octopuses_octopi_squid |
164
+ | 122 | gassing - explosion - gas - explode - explosive | 11 | 122_gassing_explosion_gas_explode |
165
+ | 123 | tutorial - handbook - chemistry - paperback - books | 11 | 123_tutorial_handbook_chemistry_paperback |
166
+ | 124 | amp - decibels - current - ampere - db | 11 | 124_amp_decibels_current_ampere |
167
+ | 125 | uniforms - jerseys - uniform - mets - reds | 11 | 125_uniforms_jerseys_uniform_mets |
168
+ | 126 | eugenics - eugenic - geneticallyengineered - genetic - genetically | 11 | 126_eugenics_eugenic_geneticallyengineered_genetic |
169
+ | 127 | fractals - fractal - fractally - compression - pascalfractals | 11 | 127_fractals_fractal_fractally_compression |
170
+ | 128 | sunview - xputimage - pixmap - pixmaps - ximage | 11 | 128_sunview_xputimage_pixmap_pixmaps |
171
+ | 129 | waving - wave - waves - bikers - bikes | 11 | 129_waving_wave_waves_bikers |
172
+ | 130 | vocoder - compressionalgorithms - compression - modems - cryptophones | 11 | 130_vocoder_compressionalgorithms_compression_modems |
173
+ | 131 | mouse - jumpiness - mousecom - mouseits - jumps | 11 | 131_mouse_jumpiness_mousecom_mouseits |
174
+ | 132 | netware - lan - workgroup - workgroups - w4wg | 10 | 132_netware_lan_workgroup_workgroups |
175
+ | 133 | timers - timer - ultralong - clock - oscillator | 10 | 133_timers_timer_ultralong_clock |
176
+
177
+ </details>
178
+
179
+ ## Training hyperparameters
180
+
181
+ * calculate_probabilities: False
182
+ * language: english
183
+ * low_memory: False
184
+ * min_topic_size: 10
185
+ * n_gram_range: (1, 1)
186
+ * nr_topics: auto
187
+ * seed_topic_list: None
188
+ * top_n_words: 10
189
+ * verbose: True
190
+ * zeroshot_min_similarity: 0.7
191
+ * zeroshot_topic_list: None
192
+
193
+ ## Framework versions
194
+
195
+ * Numpy: 1.23.5
196
+ * HDBSCAN: 0.8.33
197
+ * UMAP: 0.5.5
198
+ * Pandas: 2.2.1
199
+ * Scikit-Learn: 1.3.1
200
+ * Sentence-transformers: 2.5.1
201
+ * Transformers: 4.37.0.dev0
202
+ * Numba: 0.59.1
203
+ * Plotly: 5.20.0
204
+ * Python: 3.10.4
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": "english",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": "auto",
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null,
16
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
17
+ }
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef0a7c8022a1ea4efebc922cfd6379c0c1fe7c46b32032e2179d648bfe0d8619
3
+ size 207448
topics.json ADDED
The diff for this file is too large to render. See raw diff