Add BERTopic model
Browse files- README.md +174 -0
- config.json +15 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
tags:
|
4 |
+
- bertopic
|
5 |
+
library_name: bertopic
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
# BERTopic_ArXiv
|
10 |
+
|
11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
13 |
+
|
14 |
+
## Usage
|
15 |
+
|
16 |
+
To use this model, please install BERTopic:
|
17 |
+
|
18 |
+
```
|
19 |
+
pip install -U bertopic
|
20 |
+
```
|
21 |
+
|
22 |
+
You can use the model as follows:
|
23 |
+
|
24 |
+
```python
|
25 |
+
from bertopic import BERTopic
|
26 |
+
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")
|
27 |
+
|
28 |
+
topic_model.get_topic_info()
|
29 |
+
```
|
30 |
+
|
31 |
+
## Topic overview
|
32 |
+
|
33 |
+
* Number of topics: 107
|
34 |
+
* Number of training documents: 33189
|
35 |
+
|
36 |
+
<details>
|
37 |
+
<summary>Click here for an overview of all topics.</summary>
|
38 |
+
|
39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
40 |
+
|----------|----------------|-----------------|-------|
|
41 |
+
| -1 | language - models - model - data - based | 20 | -1_language_models_model_data |
|
42 |
+
| 0 | dialogue - dialog - response - responses - intent | 14247 | 0_dialogue_dialog_response_responses |
|
43 |
+
| 1 | speech - asr - speech recognition - recognition - end | 1833 | 1_speech_asr_speech recognition_recognition |
|
44 |
+
| 2 | tuning - tasks - prompt - models - language | 1369 | 2_tuning_tasks_prompt_models |
|
45 |
+
| 3 | summarization - summaries - summary - abstractive - document | 1109 | 3_summarization_summaries_summary_abstractive |
|
46 |
+
| 4 | question - answer - qa - answering - question answering | 893 | 4_question_answer_qa_answering |
|
47 |
+
| 5 | sentiment - sentiment analysis - aspect - analysis - opinion | 837 | 5_sentiment_sentiment analysis_aspect_analysis |
|
48 |
+
| 6 | clinical - medical - biomedical - notes - patient | 691 | 6_clinical_medical_biomedical_notes |
|
49 |
+
| 7 | translation - nmt - machine translation - neural machine - neural machine translation | 586 | 7_translation_nmt_machine translation_neural machine |
|
50 |
+
| 8 | generation - text generation - text - language generation - nlg | 558 | 8_generation_text generation_text_language generation |
|
51 |
+
| 9 | hate - hate speech - offensive - speech - detection | 484 | 9_hate_hate speech_offensive_speech |
|
52 |
+
| 10 | news - fake - fake news - stance - fact | 455 | 10_news_fake_fake news_stance |
|
53 |
+
| 11 | relation - relation extraction - extraction - relations - entity | 450 | 11_relation_relation extraction_extraction_relations |
|
54 |
+
| 12 | ner - named - named entity - entity - named entity recognition | 376 | 12_ner_named_named entity_entity |
|
55 |
+
| 13 | parsing - parser - dependency - treebank - parsers | 370 | 13_parsing_parser_dependency_treebank |
|
56 |
+
| 14 | event - temporal - events - event extraction - extraction | 314 | 14_event_temporal_events_event extraction |
|
57 |
+
| 15 | emotion - emotions - multimodal - emotion recognition - emotional | 300 | 15_emotion_emotions_multimodal_emotion recognition |
|
58 |
+
| 16 | word - embeddings - word embeddings - embedding - words | 292 | 16_word_embeddings_word embeddings_embedding |
|
59 |
+
| 17 | explanations - explanation - rationales - rationale - interpretability | 212 | 17_explanations_explanation_rationales_rationale |
|
60 |
+
| 18 | morphological - arabic - morphology - languages - inflection | 204 | 18_morphological_arabic_morphology_languages |
|
61 |
+
| 19 | topic - topics - topic models - lda - topic modeling | 200 | 19_topic_topics_topic models_lda |
|
62 |
+
| 20 | bias - gender - biases - gender bias - debiasing | 195 | 20_bias_gender_biases_gender bias |
|
63 |
+
| 21 | law - frequency - zipf - words - length | 185 | 21_law_frequency_zipf_words |
|
64 |
+
| 22 | legal - court - law - legal domain - case | 182 | 22_legal_court_law_legal domain |
|
65 |
+
| 23 | adversarial - attacks - attack - adversarial examples - robustness | 181 | 23_adversarial_attacks_attack_adversarial examples |
|
66 |
+
| 24 | commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning | 180 | 24_commonsense_commonsense knowledge_reasoning_knowledge |
|
67 |
+
| 25 | quantum - semantics - calculus - compositional - meaning | 171 | 25_quantum_semantics_calculus_compositional |
|
68 |
+
| 26 | correction - error - error correction - grammatical - grammatical error | 161 | 26_correction_error_error correction_grammatical |
|
69 |
+
| 27 | argument - arguments - argumentation - argumentative - mining | 160 | 27_argument_arguments_argumentation_argumentative |
|
70 |
+
| 28 | sarcasm - humor - sarcastic - detection - humorous | 157 | 28_sarcasm_humor_sarcastic_detection |
|
71 |
+
| 29 | coreference - resolution - coreference resolution - mentions - mention | 156 | 29_coreference_resolution_coreference resolution_mentions |
|
72 |
+
| 30 | sense - word sense - wsd - word - disambiguation | 153 | 30_sense_word sense_wsd_word |
|
73 |
+
| 31 | knowledge - knowledge graph - graph - link prediction - entities | 149 | 31_knowledge_knowledge graph_graph_link prediction |
|
74 |
+
| 32 | parsing - semantic parsing - amr - semantic - parser | 146 | 32_parsing_semantic parsing_amr_semantic |
|
75 |
+
| 33 | cross lingual - lingual - cross - transfer - languages | 146 | 33_cross lingual_lingual_cross_transfer |
|
76 |
+
| 34 | mt - translation - qe - quality - machine translation | 139 | 34_mt_translation_qe_quality |
|
77 |
+
| 35 | sql - text sql - queries - spider - schema | 138 | 35_sql_text sql_queries_spider |
|
78 |
+
| 36 | classification - text classification - label - text - labels | 136 | 36_classification_text classification_label_text |
|
79 |
+
| 37 | style - style transfer - transfer - text style - text style transfer | 136 | 37_style_style transfer_transfer_text style |
|
80 |
+
| 38 | question - question generation - questions - answer - generation | 129 | 38_question_question generation_questions_answer |
|
81 |
+
| 39 | authorship - authorship attribution - attribution - author - authors | 127 | 39_authorship_authorship attribution_attribution_author |
|
82 |
+
| 40 | sentence - sentence embeddings - similarity - sts - sentence embedding | 123 | 40_sentence_sentence embeddings_similarity_sts |
|
83 |
+
| 41 | code - identification - switching - cs - code switching | 121 | 41_code_identification_switching_cs |
|
84 |
+
| 42 | story - stories - story generation - generation - storytelling | 118 | 42_story_stories_story generation_generation |
|
85 |
+
| 43 | discourse - discourse relation - discourse relations - rst - discourse parsing | 117 | 43_discourse_discourse relation_discourse relations_rst |
|
86 |
+
| 44 | code - programming - source code - code generation - programming languages | 117 | 44_code_programming_source code_code generation |
|
87 |
+
| 45 | paraphrase - paraphrases - paraphrase generation - paraphrasing - generation | 114 | 45_paraphrase_paraphrases_paraphrase generation_paraphrasing |
|
88 |
+
| 46 | agent - games - environment - instructions - agents | 111 | 46_agent_games_environment_instructions |
|
89 |
+
| 47 | covid - covid 19 - 19 - tweets - pandemic | 108 | 47_covid_covid 19_19_tweets |
|
90 |
+
| 48 | linking - entity linking - entity - el - entities | 107 | 48_linking_entity linking_entity_el |
|
91 |
+
| 49 | poetry - poems - lyrics - poem - music | 103 | 49_poetry_poems_lyrics_poem |
|
92 |
+
| 50 | image - captioning - captions - visual - caption | 100 | 50_image_captioning_captions_visual |
|
93 |
+
| 51 | nli - entailment - inference - natural language inference - language inference | 96 | 51_nli_entailment_inference_natural language inference |
|
94 |
+
| 52 | keyphrase - keyphrases - extraction - document - phrases | 95 | 52_keyphrase_keyphrases_extraction_document |
|
95 |
+
| 53 | simplification - text simplification - ts - sentence - simplified | 95 | 53_simplification_text simplification_ts_sentence |
|
96 |
+
| 54 | empathetic - emotion - emotional - empathy - emotions | 95 | 54_empathetic_emotion_emotional_empathy |
|
97 |
+
| 55 | depression - mental - health - mental health - social media | 93 | 55_depression_mental_health_mental health |
|
98 |
+
| 56 | segmentation - word segmentation - chinese - chinese word segmentation - chinese word | 93 | 56_segmentation_word segmentation_chinese_chinese word segmentation |
|
99 |
+
| 57 | citation - scientific - papers - citations - scholarly | 85 | 57_citation_scientific_papers_citations |
|
100 |
+
| 58 | agreement - syntactic - verb - grammatical - subject verb | 85 | 58_agreement_syntactic_verb_grammatical |
|
101 |
+
| 59 | metaphor - literal - figurative - metaphors - idiomatic | 83 | 59_metaphor_literal_figurative_metaphors |
|
102 |
+
| 60 | srl - semantic role - role labeling - semantic role labeling - role | 82 | 60_srl_semantic role_role labeling_semantic role labeling |
|
103 |
+
| 61 | privacy - private - federated - privacy preserving - federated learning | 82 | 61_privacy_private_federated_privacy preserving |
|
104 |
+
| 62 | change - semantic change - time - semantic - lexical semantic | 82 | 62_change_semantic change_time_semantic |
|
105 |
+
| 63 | bilingual - lingual - cross lingual - cross - embeddings | 80 | 63_bilingual_lingual_cross lingual_cross |
|
106 |
+
| 64 | political - media - news - bias - articles | 77 | 64_political_media_news_bias |
|
107 |
+
| 65 | medical - qa - question - questions - clinical | 75 | 65_medical_qa_question_questions |
|
108 |
+
| 66 | math - mathematical - math word - word problems - problems | 73 | 66_math_mathematical_math word_word problems |
|
109 |
+
| 67 | financial - stock - market - price - news | 69 | 67_financial_stock_market_price |
|
110 |
+
| 68 | table - tables - tabular - reasoning - qa | 69 | 68_table_tables_tabular_reasoning |
|
111 |
+
| 69 | readability - complexity - assessment - features - reading | 65 | 69_readability_complexity_assessment_features |
|
112 |
+
| 70 | layout - document - documents - document understanding - extraction | 64 | 70_layout_document_documents_document understanding |
|
113 |
+
| 71 | brain - cognitive - reading - syntactic - language | 62 | 71_brain_cognitive_reading_syntactic |
|
114 |
+
| 72 | sign - gloss - language - signed - language translation | 61 | 72_sign_gloss_language_signed |
|
115 |
+
| 73 | vqa - visual - visual question - visual question answering - question | 59 | 73_vqa_visual_visual question_visual question answering |
|
116 |
+
| 74 | biased - biases - spurious - nlp - debiasing | 57 | 74_biased_biases_spurious_nlp |
|
117 |
+
| 75 | visual - dialogue - multimodal - image - dialog | 55 | 75_visual_dialogue_multimodal_image |
|
118 |
+
| 76 | translation - machine translation - machine - smt - statistical | 54 | 76_translation_machine translation_machine_smt |
|
119 |
+
| 77 | multimodal - visual - image - translation - machine translation | 52 | 77_multimodal_visual_image_translation |
|
120 |
+
| 78 | geographic - location - geolocation - geo - locations | 51 | 78_geographic_location_geolocation_geo |
|
121 |
+
| 79 | reasoning - prompting - llms - chain thought - chain | 48 | 79_reasoning_prompting_llms_chain thought |
|
122 |
+
| 80 | essay - scoring - aes - essay scoring - essays | 45 | 80_essay_scoring_aes_essay scoring |
|
123 |
+
| 81 | crisis - disaster - traffic - tweets - disasters | 45 | 81_crisis_disaster_traffic_tweets |
|
124 |
+
| 82 | graph - text classification - text - gcn - classification | 44 | 82_graph_text classification_text_gcn |
|
125 |
+
| 83 | annotation - tools - linguistic - resources - xml | 43 | 83_annotation_tools_linguistic_resources |
|
126 |
+
| 84 | entity alignment - alignment - kgs - entity - ea | 43 | 84_entity alignment_alignment_kgs_entity |
|
127 |
+
| 85 | personality - traits - personality traits - evaluative - text | 42 | 85_personality_traits_personality traits_evaluative |
|
128 |
+
| 86 | ad - alzheimer - alzheimer disease - disease - speech | 40 | 86_ad_alzheimer_alzheimer disease_disease |
|
129 |
+
| 87 | taxonomy - hypernymy - taxonomies - hypernym - hypernyms | 39 | 87_taxonomy_hypernymy_taxonomies_hypernym |
|
130 |
+
| 88 | active learning - active - al - learning - uncertainty | 37 | 88_active learning_active_al_learning |
|
131 |
+
| 89 | reviews - summaries - summarization - review - opinion | 36 | 89_reviews_summaries_summarization_review |
|
132 |
+
| 90 | emoji - emojis - sentiment - message - anonymous | 35 | 90_emoji_emojis_sentiment_message |
|
133 |
+
| 91 | table - table text - tables - table text generation - text generation | 35 | 91_table_table text_tables_table text generation |
|
134 |
+
| 92 | domain - domain adaptation - adaptation - domains - source | 35 | 92_domain_domain adaptation_adaptation_domains |
|
135 |
+
| 93 | alignment - word alignment - parallel - pairs - alignments | 34 | 93_alignment_word alignment_parallel_pairs |
|
136 |
+
| 94 | indo - languages - indo european - names - family | 34 | 94_indo_languages_indo european_names |
|
137 |
+
| 95 | patent - claim - claim generation - chemical - technical | 32 | 95_patent_claim_claim generation_chemical |
|
138 |
+
| 96 | agents - emergent - communication - referential - games | 32 | 96_agents_emergent_communication_referential |
|
139 |
+
| 97 | graph - amr - graph text - graphs - text generation | 31 | 97_graph_amr_graph text_graphs |
|
140 |
+
| 98 | moral - ethical - norms - values - social | 29 | 98_moral_ethical_norms_values |
|
141 |
+
| 99 | acronym - acronyms - abbreviations - abbreviation - disambiguation | 27 | 99_acronym_acronyms_abbreviations_abbreviation |
|
142 |
+
| 100 | typing - entity typing - entity - type - types | 27 | 100_typing_entity typing_entity_type |
|
143 |
+
| 101 | coherence - discourse - discourse coherence - coherence modeling - text | 26 | 101_coherence_discourse_discourse coherence_coherence modeling |
|
144 |
+
| 102 | pos - taggers - tagging - tagger - pos tagging | 25 | 102_pos_taggers_tagging_tagger |
|
145 |
+
| 103 | drug - social - social media - media - health | 25 | 103_drug_social_social media_media |
|
146 |
+
| 104 | gender - translation - bias - gender bias - mt | 24 | 104_gender_translation_bias_gender bias |
|
147 |
+
| 105 | job - resume - skills - skill - soft | 21 | 105_job_resume_skills_skill |
|
148 |
+
|
149 |
+
</details>
|
150 |
+
|
151 |
+
## Training hyperparameters
|
152 |
+
|
153 |
+
* calculate_probabilities: False
|
154 |
+
* language: None
|
155 |
+
* low_memory: False
|
156 |
+
* min_topic_size: 10
|
157 |
+
* n_gram_range: (1, 1)
|
158 |
+
* nr_topics: None
|
159 |
+
* seed_topic_list: None
|
160 |
+
* top_n_words: 10
|
161 |
+
* verbose: True
|
162 |
+
|
163 |
+
## Framework versions
|
164 |
+
|
165 |
+
* Numpy: 1.22.4
|
166 |
+
* HDBSCAN: 0.8.29
|
167 |
+
* UMAP: 0.5.3
|
168 |
+
* Pandas: 1.5.3
|
169 |
+
* Scikit-Learn: 1.2.2
|
170 |
+
* Sentence-transformers: 2.2.2
|
171 |
+
* Transformers: 4.29.2
|
172 |
+
* Numba: 0.56.4
|
173 |
+
* Plotly: 5.13.1
|
174 |
+
* Python: 3.10.11
|
config.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"calculate_probabilities": false,
|
3 |
+
"language": null,
|
4 |
+
"low_memory": false,
|
5 |
+
"min_topic_size": 10,
|
6 |
+
"n_gram_range": [
|
7 |
+
1,
|
8 |
+
1
|
9 |
+
],
|
10 |
+
"nr_topics": null,
|
11 |
+
"seed_topic_list": null,
|
12 |
+
"top_n_words": 10,
|
13 |
+
"verbose": true,
|
14 |
+
"embedding_model": "sentence-transformers/all-mpnet-base-v2"
|
15 |
+
}
|
ctfidf.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:547360497eca2939ce70813b95f0234945004863f67c0e21ced47b26ca3e4dca
|
3 |
+
size 9317840
|
ctfidf_config.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
topic_embeddings.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0c96563fb1306ff6b742112a98e27f85ec911c27c11d3b381e1cb849a1a72088
|
3 |
+
size 328792
|
topics.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|