File size: 19,506 Bytes
b24da6e
 
 
 
 
 
 
 
 
 
 
 
 
a73e7b6
 
 
 
72ee424
 
 
 
a73e7b6
 
 
 
 
 
 
 
 
43dd6fb
 
b24da6e
 
 
 
 
 
a979078
b24da6e
 
 
 
 
 
 
 
 
 
 
a73e7b6
 
 
f267f56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a73e7b6
 
b24da6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a73e7b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b24da6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304

---
tags:
- bertopic
library_name: bertopic
pipeline_tag: text-classification
---

# BERTopic_ArXiv

This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. 
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. 

This pre-trained model demonstrates the use of several representation models that can be used within BERTopic. This model was trained on ~30000 ArXiv abstracts with the following topic representation methods (`bertopic.representation`):

* POS
* KeyBERTInspired
* MaximalMarginalRelevance
* KeyBERT + MaximalMarginalRelevance
* ChatGPT labels 
* ChatGPT summaries

An example of the default c-TF-IDF representations:

!["multiaspect.png"](visualization_multiaspect.png)

An example of labels generated by ChatGPT (`gpt-3.5-turbo`):

!["multiaspect.png"](visualization_multiaspect_openai.png)

To generate these images, you can follow along with this tutorial: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)

## Usage 

To use this model, please install BERTopic:

```
pip install -U bertopic
pip install -U safetensors
```

You can use the model as follows:

```python
from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

topic_model.get_topic_info()
```

To view all different topic representations (keywords, labels, summary, etc.) you can run the following:

```python
>>> topic_model.get_topic(0, full=True)
{'Main': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'POS': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181],
  ['user', 0.008753551043296965]],
 'KeyBERTInspired': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'MMR': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'KeyBERT + MMR': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'OpenAI_Label': [['Challenges and Approaches in Developing Task-oriented Dialogue Systems',
   1]],
 'OpenAI_Summary': [['Task-oriented dialogue systems and their components, such as dialogue policy, natural language understanding, dialogue state tracking, response generation, and end-to-end training using neural networks. These components are crucial in assisting users to complete various activities such as booking tickets and restaurant reservations through spoken language understanding dialogue. The challenge lies in tracking dialogue states of multiple domains and obtaining annotations for training. Effective SLU is achieved by utilizing context from the prior dialogue history.',
   1]]}
```

## Topic overview

* Number of topics: 107
* Number of training documents: 33189

<details>
  <summary>Click here for an overview of all topics.</summary>
  
  | Topic ID | Topic Keywords | Topic Frequency | Label | 
|----------|----------------|-----------------|-------| 
| -1 | language - models - model - data - based | 20 | -1_language_models_model_data | 
| 0 | dialogue - dialog - response - responses - intent | 14247 | 0_dialogue_dialog_response_responses | 
| 1 | speech - asr - speech recognition - recognition - end | 1833 | 1_speech_asr_speech recognition_recognition | 
| 2 | tuning - tasks - prompt - models - language | 1369 | 2_tuning_tasks_prompt_models | 
| 3 | summarization - summaries - summary - abstractive - document | 1109 | 3_summarization_summaries_summary_abstractive | 
| 4 | question - answer - qa - answering - question answering | 893 | 4_question_answer_qa_answering | 
| 5 | sentiment - sentiment analysis - aspect - analysis - opinion | 837 | 5_sentiment_sentiment analysis_aspect_analysis | 
| 6 | clinical - medical - biomedical - notes - patient | 691 | 6_clinical_medical_biomedical_notes | 
| 7 | translation - nmt - machine translation - neural machine - neural machine translation | 586 | 7_translation_nmt_machine translation_neural machine | 
| 8 | generation - text generation - text - language generation - nlg | 558 | 8_generation_text generation_text_language generation | 
| 9 | hate - hate speech - offensive - speech - detection | 484 | 9_hate_hate speech_offensive_speech | 
| 10 | news - fake - fake news - stance - fact | 455 | 10_news_fake_fake news_stance | 
| 11 | relation - relation extraction - extraction - relations - entity | 450 | 11_relation_relation extraction_extraction_relations | 
| 12 | ner - named - named entity - entity - named entity recognition | 376 | 12_ner_named_named entity_entity | 
| 13 | parsing - parser - dependency - treebank - parsers | 370 | 13_parsing_parser_dependency_treebank | 
| 14 | event - temporal - events - event extraction - extraction | 314 | 14_event_temporal_events_event extraction | 
| 15 | emotion - emotions - multimodal - emotion recognition - emotional | 300 | 15_emotion_emotions_multimodal_emotion recognition | 
| 16 | word - embeddings - word embeddings - embedding - words | 292 | 16_word_embeddings_word embeddings_embedding | 
| 17 | explanations - explanation - rationales - rationale - interpretability | 212 | 17_explanations_explanation_rationales_rationale | 
| 18 | morphological - arabic - morphology - languages - inflection | 204 | 18_morphological_arabic_morphology_languages | 
| 19 | topic - topics - topic models - lda - topic modeling | 200 | 19_topic_topics_topic models_lda | 
| 20 | bias - gender - biases - gender bias - debiasing | 195 | 20_bias_gender_biases_gender bias | 
| 21 | law - frequency - zipf - words - length | 185 | 21_law_frequency_zipf_words | 
| 22 | legal - court - law - legal domain - case | 182 | 22_legal_court_law_legal domain | 
| 23 | adversarial - attacks - attack - adversarial examples - robustness | 181 | 23_adversarial_attacks_attack_adversarial examples | 
| 24 | commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning | 180 | 24_commonsense_commonsense knowledge_reasoning_knowledge | 
| 25 | quantum - semantics - calculus - compositional - meaning | 171 | 25_quantum_semantics_calculus_compositional | 
| 26 | correction - error - error correction - grammatical - grammatical error | 161 | 26_correction_error_error correction_grammatical | 
| 27 | argument - arguments - argumentation - argumentative - mining | 160 | 27_argument_arguments_argumentation_argumentative | 
| 28 | sarcasm - humor - sarcastic - detection - humorous | 157 | 28_sarcasm_humor_sarcastic_detection | 
| 29 | coreference - resolution - coreference resolution - mentions - mention | 156 | 29_coreference_resolution_coreference resolution_mentions | 
| 30 | sense - word sense - wsd - word - disambiguation | 153 | 30_sense_word sense_wsd_word | 
| 31 | knowledge - knowledge graph - graph - link prediction - entities | 149 | 31_knowledge_knowledge graph_graph_link prediction | 
| 32 | parsing - semantic parsing - amr - semantic - parser | 146 | 32_parsing_semantic parsing_amr_semantic | 
| 33 | cross lingual - lingual - cross - transfer - languages | 146 | 33_cross lingual_lingual_cross_transfer | 
| 34 | mt - translation - qe - quality - machine translation | 139 | 34_mt_translation_qe_quality | 
| 35 | sql - text sql - queries - spider - schema | 138 | 35_sql_text sql_queries_spider | 
| 36 | classification - text classification - label - text - labels | 136 | 36_classification_text classification_label_text | 
| 37 | style - style transfer - transfer - text style - text style transfer | 136 | 37_style_style transfer_transfer_text style | 
| 38 | question - question generation - questions - answer - generation | 129 | 38_question_question generation_questions_answer | 
| 39 | authorship - authorship attribution - attribution - author - authors | 127 | 39_authorship_authorship attribution_attribution_author | 
| 40 | sentence - sentence embeddings - similarity - sts - sentence embedding | 123 | 40_sentence_sentence embeddings_similarity_sts | 
| 41 | code - identification - switching - cs - code switching | 121 | 41_code_identification_switching_cs | 
| 42 | story - stories - story generation - generation - storytelling | 118 | 42_story_stories_story generation_generation | 
| 43 | discourse - discourse relation - discourse relations - rst - discourse parsing | 117 | 43_discourse_discourse relation_discourse relations_rst | 
| 44 | code - programming - source code - code generation - programming languages | 117 | 44_code_programming_source code_code generation | 
| 45 | paraphrase - paraphrases - paraphrase generation - paraphrasing - generation | 114 | 45_paraphrase_paraphrases_paraphrase generation_paraphrasing | 
| 46 | agent - games - environment - instructions - agents | 111 | 46_agent_games_environment_instructions | 
| 47 | covid - covid 19 - 19 - tweets - pandemic | 108 | 47_covid_covid 19_19_tweets | 
| 48 | linking - entity linking - entity - el - entities | 107 | 48_linking_entity linking_entity_el | 
| 49 | poetry - poems - lyrics - poem - music | 103 | 49_poetry_poems_lyrics_poem | 
| 50 | image - captioning - captions - visual - caption | 100 | 50_image_captioning_captions_visual | 
| 51 | nli - entailment - inference - natural language inference - language inference | 96 | 51_nli_entailment_inference_natural language inference | 
| 52 | keyphrase - keyphrases - extraction - document - phrases | 95 | 52_keyphrase_keyphrases_extraction_document | 
| 53 | simplification - text simplification - ts - sentence - simplified | 95 | 53_simplification_text simplification_ts_sentence | 
| 54 | empathetic - emotion - emotional - empathy - emotions | 95 | 54_empathetic_emotion_emotional_empathy | 
| 55 | depression - mental - health - mental health - social media | 93 | 55_depression_mental_health_mental health | 
| 56 | segmentation - word segmentation - chinese - chinese word segmentation - chinese word | 93 | 56_segmentation_word segmentation_chinese_chinese word segmentation | 
| 57 | citation - scientific - papers - citations - scholarly | 85 | 57_citation_scientific_papers_citations | 
| 58 | agreement - syntactic - verb - grammatical - subject verb | 85 | 58_agreement_syntactic_verb_grammatical | 
| 59 | metaphor - literal - figurative - metaphors - idiomatic | 83 | 59_metaphor_literal_figurative_metaphors | 
| 60 | srl - semantic role - role labeling - semantic role labeling - role | 82 | 60_srl_semantic role_role labeling_semantic role labeling | 
| 61 | privacy - private - federated - privacy preserving - federated learning | 82 | 61_privacy_private_federated_privacy preserving | 
| 62 | change - semantic change - time - semantic - lexical semantic | 82 | 62_change_semantic change_time_semantic | 
| 63 | bilingual - lingual - cross lingual - cross - embeddings | 80 | 63_bilingual_lingual_cross lingual_cross | 
| 64 | political - media - news - bias - articles | 77 | 64_political_media_news_bias | 
| 65 | medical - qa - question - questions - clinical | 75 | 65_medical_qa_question_questions | 
| 66 | math - mathematical - math word - word problems - problems | 73 | 66_math_mathematical_math word_word problems | 
| 67 | financial - stock - market - price - news | 69 | 67_financial_stock_market_price | 
| 68 | table - tables - tabular - reasoning - qa | 69 | 68_table_tables_tabular_reasoning | 
| 69 | readability - complexity - assessment - features - reading | 65 | 69_readability_complexity_assessment_features | 
| 70 | layout - document - documents - document understanding - extraction | 64 | 70_layout_document_documents_document understanding | 
| 71 | brain - cognitive - reading - syntactic - language | 62 | 71_brain_cognitive_reading_syntactic | 
| 72 | sign - gloss - language - signed - language translation | 61 | 72_sign_gloss_language_signed | 
| 73 | vqa - visual - visual question - visual question answering - question | 59 | 73_vqa_visual_visual question_visual question answering | 
| 74 | biased - biases - spurious - nlp - debiasing | 57 | 74_biased_biases_spurious_nlp | 
| 75 | visual - dialogue - multimodal - image - dialog | 55 | 75_visual_dialogue_multimodal_image | 
| 76 | translation - machine translation - machine - smt - statistical | 54 | 76_translation_machine translation_machine_smt | 
| 77 | multimodal - visual - image - translation - machine translation | 52 | 77_multimodal_visual_image_translation | 
| 78 | geographic - location - geolocation - geo - locations | 51 | 78_geographic_location_geolocation_geo | 
| 79 | reasoning - prompting - llms - chain thought - chain | 48 | 79_reasoning_prompting_llms_chain thought | 
| 80 | essay - scoring - aes - essay scoring - essays | 45 | 80_essay_scoring_aes_essay scoring | 
| 81 | crisis - disaster - traffic - tweets - disasters | 45 | 81_crisis_disaster_traffic_tweets | 
| 82 | graph - text classification - text - gcn - classification | 44 | 82_graph_text classification_text_gcn | 
| 83 | annotation - tools - linguistic - resources - xml | 43 | 83_annotation_tools_linguistic_resources | 
| 84 | entity alignment - alignment - kgs - entity - ea | 43 | 84_entity alignment_alignment_kgs_entity | 
| 85 | personality - traits - personality traits - evaluative - text | 42 | 85_personality_traits_personality traits_evaluative | 
| 86 | ad - alzheimer - alzheimer disease - disease - speech | 40 | 86_ad_alzheimer_alzheimer disease_disease | 
| 87 | taxonomy - hypernymy - taxonomies - hypernym - hypernyms | 39 | 87_taxonomy_hypernymy_taxonomies_hypernym | 
| 88 | active learning - active - al - learning - uncertainty | 37 | 88_active learning_active_al_learning | 
| 89 | reviews - summaries - summarization - review - opinion | 36 | 89_reviews_summaries_summarization_review | 
| 90 | emoji - emojis - sentiment - message - anonymous | 35 | 90_emoji_emojis_sentiment_message | 
| 91 | table - table text - tables - table text generation - text generation | 35 | 91_table_table text_tables_table text generation | 
| 92 | domain - domain adaptation - adaptation - domains - source | 35 | 92_domain_domain adaptation_adaptation_domains | 
| 93 | alignment - word alignment - parallel - pairs - alignments | 34 | 93_alignment_word alignment_parallel_pairs | 
| 94 | indo - languages - indo european - names - family | 34 | 94_indo_languages_indo european_names | 
| 95 | patent - claim - claim generation - chemical - technical | 32 | 95_patent_claim_claim generation_chemical | 
| 96 | agents - emergent - communication - referential - games | 32 | 96_agents_emergent_communication_referential | 
| 97 | graph - amr - graph text - graphs - text generation | 31 | 97_graph_amr_graph text_graphs | 
| 98 | moral - ethical - norms - values - social | 29 | 98_moral_ethical_norms_values | 
| 99 | acronym - acronyms - abbreviations - abbreviation - disambiguation | 27 | 99_acronym_acronyms_abbreviations_abbreviation | 
| 100 | typing - entity typing - entity - type - types | 27 | 100_typing_entity typing_entity_type | 
| 101 | coherence - discourse - discourse coherence - coherence modeling - text | 26 | 101_coherence_discourse_discourse coherence_coherence modeling | 
| 102 | pos - taggers - tagging - tagger - pos tagging | 25 | 102_pos_taggers_tagging_tagger | 
| 103 | drug - social - social media - media - health | 25 | 103_drug_social_social media_media | 
| 104 | gender - translation - bias - gender bias - mt | 24 | 104_gender_translation_bias_gender bias | 
| 105 | job - resume - skills - skill - soft | 21 | 105_job_resume_skills_skill |
  
</details>

## Training Procedure

The model was trained as follows:

```python
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import PartOfSpeech, KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# Prepare sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Summarization with ChatGPT
summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""
summarization_model = OpenAI(model="gpt-3.5-turbo", chat=True, prompt=summarization_prompt, nr_docs=5, exponential_backoff=True, diversity=0.1)

# Representation models
representation_models = {
    "POS": PartOfSpeech("en_core_web_lg"),
    "KeyBERTInspired": KeyBERTInspired(),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
    "KeyBERT + MMR": [KeyBERTInspired(), MaximalMarginalRelevance(diversity=0.3)],
    "OpenAI_Label": OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, diversity=0.1),
    "OpenAI_Summary": [KeyBERTInspired(), summarization_model],
}

# Fit BERTopic
topic_model= BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        representation_model=representation_models,
        verbose=True
).fit(docs)
```

## Training hyperparameters

* calculate_probabilities: False
* language: None
* low_memory: False
* min_topic_size: 10
* n_gram_range: (1, 1)
* nr_topics: None
* seed_topic_list: None
* top_n_words: 10
* verbose: True

## Framework versions

* Numpy: 1.22.4
* HDBSCAN: 0.8.29
* UMAP: 0.5.3
* Pandas: 1.5.3
* Scikit-Learn: 1.2.2
* Sentence-transformers: 2.2.2
* Transformers: 4.29.2
* Numba: 0.56.4
* Plotly: 5.13.1
* Python: 3.10.11