|
|
|
--- |
|
tags: |
|
- bertopic |
|
library_name: bertopic |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# xsum_123_3000_1500_train |
|
|
|
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. |
|
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. |
|
|
|
## Usage |
|
|
|
To use this model, please install BERTopic: |
|
|
|
``` |
|
pip install -U bertopic |
|
``` |
|
|
|
You can use the model as follows: |
|
|
|
```python |
|
from bertopic import BERTopic |
|
topic_model = BERTopic.load("KingKazma/xsum_123_3000_1500_train") |
|
|
|
topic_model.get_topic_info() |
|
``` |
|
|
|
## Topic overview |
|
|
|
* Number of topics: 47 |
|
* Number of training documents: 3000 |
|
|
|
<details> |
|
<summary>Click here for an overview of all topics.</summary> |
|
|
|
| Topic ID | Topic Keywords | Topic Frequency | Label | |
|
|----------|----------------|-----------------|-------| |
|
| -1 | said - mr - police - people - would | 5 | -1_said_mr_police_people | |
|
| 0 | win - game - half - foul - league | 1132 | 0_win_game_half_foul | |
|
| 1 | eu - labour - party - would - uk | 591 | 1_eu_labour_party_would | |
|
| 2 | athlete - sport - gold - olympic - medal | 149 | 2_athlete_sport_gold_olympic | |
|
| 3 | nhs - health - care - patient - hospital | 104 | 3_nhs_health_care_patient | |
|
| 4 | growth - price - market - sale - economy | 84 | 4_growth_price_market_sale | |
|
| 5 | president - mr - government - maduro - rousseff | 71 | 5_president_mr_government_maduro | |
|
| 6 | crash - police - hospital - road - driver | 58 | 6_crash_police_hospital_road | |
|
| 7 | murray - match - set - tennis - seed | 46 | 7_murray_match_set_tennis | |
|
| 8 | syrian - us - syria - rebel - force | 45 | 8_syrian_us_syria_rebel | |
|
| 9 | school - education - pupil - schools - child | 41 | 9_school_education_pupil_schools | |
|
| 10 | animal - zoo - wildlife - bird - specie | 40 | 10_animal_zoo_wildlife_bird | |
|
| 11 | film - actor - star - series - drama | 38 | 11_film_actor_star_series | |
|
| 12 | abuse - court - sexual - police - victim | 38 | 12_abuse_court_sexual_police | |
|
| 13 | trump - mr - clinton - republican - president | 31 | 13_trump_mr_clinton_republican | |
|
| 14 | fire - blaze - building - service - firefighters | 31 | 14_fire_blaze_building_service | |
|
| 15 | suu - party - mr - government - election | 29 | 15_suu_party_mr_government | |
|
| 16 | china - korea - chinese - south - north | 29 | 16_china_korea_chinese_south | |
|
| 17 | album - band - song - music - best | 25 | 17_album_band_song_music | |
|
| 18 | ms - heard - court - death - said | 24 | 18_ms_heard_court_death | |
|
| 19 | wales - welsh - said - train - government | 23 | 19_wales_welsh_said_train | |
|
| 20 | road - police - death - seen - found | 23 | 20_road_police_death_seen | |
|
| 21 | passenger - crew - sea - boat - aircraft | 23 | 21_passenger_crew_sea_boat | |
|
| 22 | russian - ukraine - russia - mr - ukrainian | 22 | 22_russian_ukraine_russia_mr | |
|
| 23 | fight - joshua - title - khan - boxing | 22 | 23_fight_joshua_title_khan | |
|
| 24 | samsung - phone - app - android - user | 20 | 24_samsung_phone_app_android | |
|
| 25 | earthquake - particle - nepal - building - mars | 19 | 25_earthquake_particle_nepal_building | |
|
| 26 | highways - traffic - dartford - council - road | 18 | 26_highways_traffic_dartford_council | |
|
| 27 | vettel - hamilton - lap - race - alonso | 18 | 27_vettel_hamilton_lap_race | |
|
| 28 | park - building - visitor - festival - visitscotland | 16 | 28_park_building_visitor_festival | |
|
| 29 | site - council - street - project - plan | 15 | 29_site_council_street_project | |
|
| 30 | abdeslam - paris - attack - belgian - salah | 15 | 30_abdeslam_paris_attack_belgian | |
|
| 31 | virus - ebola - disease - hiv - sierra | 14 | 31_virus_ebola_disease_hiv | |
|
| 32 | security - data - attack - cyber - malware | 14 | 32_security_data_attack_cyber | |
|
| 33 | dog - dogs - stray - pet - owner | 14 | 33_dog_dogs_stray_pet | |
|
| 34 | birdie - pga - bogey - woods - open | 13 | 34_birdie_pga_bogey_woods | |
|
| 35 | man - police - wearing - incident - anyone | 13 | 35_man_police_wearing_incident | |
|
| 36 | energy - pipeline - waste - renewables - electricity | 13 | 36_energy_pipeline_waste_renewables | |
|
| 37 | silence - bishop - belfast - people - attended | 11 | 37_silence_bishop_belfast_people | |
|
| 38 | painting - art - work - artist - exhibition | 11 | 38_painting_art_work_artist | |
|
| 39 | eyre - gaunt - lyttle - peter - court | 10 | 39_eyre_gaunt_lyttle_peter | |
|
| 40 | crime - police - force - constable - chief | 9 | 40_crime_police_force_constable | |
|
| 41 | flood - river - rain - louisiana - flooded | 9 | 41_flood_river_rain_louisiana | |
|
| 42 | charity - abuse - yentob - porn - batmanghelidjh | 7 | 42_charity_abuse_yentob_porn | |
|
| 43 | india - nidar - gun - yrf - film | 6 | 43_india_nidar_gun_yrf | |
|
| 44 | driving - stirling - winn - fraser - road | 6 | 44_driving_stirling_winn_fraser | |
|
| 45 | boko - haram - shekau - militant - monguno | 5 | 45_boko_haram_shekau_militant | |
|
|
|
</details> |
|
|
|
## Training hyperparameters |
|
|
|
* calculate_probabilities: True |
|
* language: english |
|
* low_memory: False |
|
* min_topic_size: 10 |
|
* n_gram_range: (1, 1) |
|
* nr_topics: None |
|
* seed_topic_list: None |
|
* top_n_words: 10 |
|
* verbose: False |
|
|
|
## Framework versions |
|
|
|
* Numpy: 1.22.4 |
|
* HDBSCAN: 0.8.33 |
|
* UMAP: 0.5.3 |
|
* Pandas: 1.5.3 |
|
* Scikit-Learn: 1.2.2 |
|
* Sentence-transformers: 2.2.2 |
|
* Transformers: 4.31.0 |
|
* Numba: 0.57.1 |
|
* Plotly: 5.13.1 |
|
* Python: 3.10.12 |
|
|