BERTopic_ArXiv / README.md
MaartenGr's picture
Update README.md
a979078
|
raw
history blame
16.5 kB
metadata
tags:
  - bertopic
library_name: bertopic
pipeline_tag: text-classification

BERTopic_ArXiv

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This pre-trained model demonstrates the use of several representation models that can be used within BERTopic. This model was trained on ~30000 ArXiv abstracts with the following topic representation methods (bertopic.representation):

  • POS
  • KeyBERTInspired
  • MaximalMarginalRelevance
  • KeyBERT + MaximalMarginalRelevance
  • ChatGPT labels
  • ChatGPT summaries

An example of the default c-TF-IDF representations:

"multiaspect.png"

An example of labels generated by ChatGPT (gpt-3.5-turbo):

"multiaspect.png"

Usage

To use this model, please install BERTopic:

pip install -U bertopic
pip install -U safetensors

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

topic_model.get_topic_info()

To view all different topic representations (keywords, labels, summary, etc.) you can run the following:

topic_model.get_topic(1, full=True)

Topic overview

  • Number of topics: 107
  • Number of training documents: 33189
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
-1 language - models - model - data - based 20 -1_language_models_model_data
0 dialogue - dialog - response - responses - intent 14247 0_dialogue_dialog_response_responses
1 speech - asr - speech recognition - recognition - end 1833 1_speech_asr_speech recognition_recognition
2 tuning - tasks - prompt - models - language 1369 2_tuning_tasks_prompt_models
3 summarization - summaries - summary - abstractive - document 1109 3_summarization_summaries_summary_abstractive
4 question - answer - qa - answering - question answering 893 4_question_answer_qa_answering
5 sentiment - sentiment analysis - aspect - analysis - opinion 837 5_sentiment_sentiment analysis_aspect_analysis
6 clinical - medical - biomedical - notes - patient 691 6_clinical_medical_biomedical_notes
7 translation - nmt - machine translation - neural machine - neural machine translation 586 7_translation_nmt_machine translation_neural machine
8 generation - text generation - text - language generation - nlg 558 8_generation_text generation_text_language generation
9 hate - hate speech - offensive - speech - detection 484 9_hate_hate speech_offensive_speech
10 news - fake - fake news - stance - fact 455 10_news_fake_fake news_stance
11 relation - relation extraction - extraction - relations - entity 450 11_relation_relation extraction_extraction_relations
12 ner - named - named entity - entity - named entity recognition 376 12_ner_named_named entity_entity
13 parsing - parser - dependency - treebank - parsers 370 13_parsing_parser_dependency_treebank
14 event - temporal - events - event extraction - extraction 314 14_event_temporal_events_event extraction
15 emotion - emotions - multimodal - emotion recognition - emotional 300 15_emotion_emotions_multimodal_emotion recognition
16 word - embeddings - word embeddings - embedding - words 292 16_word_embeddings_word embeddings_embedding
17 explanations - explanation - rationales - rationale - interpretability 212 17_explanations_explanation_rationales_rationale
18 morphological - arabic - morphology - languages - inflection 204 18_morphological_arabic_morphology_languages
19 topic - topics - topic models - lda - topic modeling 200 19_topic_topics_topic models_lda
20 bias - gender - biases - gender bias - debiasing 195 20_bias_gender_biases_gender bias
21 law - frequency - zipf - words - length 185 21_law_frequency_zipf_words
22 legal - court - law - legal domain - case 182 22_legal_court_law_legal domain
23 adversarial - attacks - attack - adversarial examples - robustness 181 23_adversarial_attacks_attack_adversarial examples
24 commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning 180 24_commonsense_commonsense knowledge_reasoning_knowledge
25 quantum - semantics - calculus - compositional - meaning 171 25_quantum_semantics_calculus_compositional
26 correction - error - error correction - grammatical - grammatical error 161 26_correction_error_error correction_grammatical
27 argument - arguments - argumentation - argumentative - mining 160 27_argument_arguments_argumentation_argumentative
28 sarcasm - humor - sarcastic - detection - humorous 157 28_sarcasm_humor_sarcastic_detection
29 coreference - resolution - coreference resolution - mentions - mention 156 29_coreference_resolution_coreference resolution_mentions
30 sense - word sense - wsd - word - disambiguation 153 30_sense_word sense_wsd_word
31 knowledge - knowledge graph - graph - link prediction - entities 149 31_knowledge_knowledge graph_graph_link prediction
32 parsing - semantic parsing - amr - semantic - parser 146 32_parsing_semantic parsing_amr_semantic
33 cross lingual - lingual - cross - transfer - languages 146 33_cross lingual_lingual_cross_transfer
34 mt - translation - qe - quality - machine translation 139 34_mt_translation_qe_quality
35 sql - text sql - queries - spider - schema 138 35_sql_text sql_queries_spider
36 classification - text classification - label - text - labels 136 36_classification_text classification_label_text
37 style - style transfer - transfer - text style - text style transfer 136 37_style_style transfer_transfer_text style
38 question - question generation - questions - answer - generation 129 38_question_question generation_questions_answer
39 authorship - authorship attribution - attribution - author - authors 127 39_authorship_authorship attribution_attribution_author
40 sentence - sentence embeddings - similarity - sts - sentence embedding 123 40_sentence_sentence embeddings_similarity_sts
41 code - identification - switching - cs - code switching 121 41_code_identification_switching_cs
42 story - stories - story generation - generation - storytelling 118 42_story_stories_story generation_generation
43 discourse - discourse relation - discourse relations - rst - discourse parsing 117 43_discourse_discourse relation_discourse relations_rst
44 code - programming - source code - code generation - programming languages 117 44_code_programming_source code_code generation
45 paraphrase - paraphrases - paraphrase generation - paraphrasing - generation 114 45_paraphrase_paraphrases_paraphrase generation_paraphrasing
46 agent - games - environment - instructions - agents 111 46_agent_games_environment_instructions
47 covid - covid 19 - 19 - tweets - pandemic 108 47_covid_covid 19_19_tweets
48 linking - entity linking - entity - el - entities 107 48_linking_entity linking_entity_el
49 poetry - poems - lyrics - poem - music 103 49_poetry_poems_lyrics_poem
50 image - captioning - captions - visual - caption 100 50_image_captioning_captions_visual
51 nli - entailment - inference - natural language inference - language inference 96 51_nli_entailment_inference_natural language inference
52 keyphrase - keyphrases - extraction - document - phrases 95 52_keyphrase_keyphrases_extraction_document
53 simplification - text simplification - ts - sentence - simplified 95 53_simplification_text simplification_ts_sentence
54 empathetic - emotion - emotional - empathy - emotions 95 54_empathetic_emotion_emotional_empathy
55 depression - mental - health - mental health - social media 93 55_depression_mental_health_mental health
56 segmentation - word segmentation - chinese - chinese word segmentation - chinese word 93 56_segmentation_word segmentation_chinese_chinese word segmentation
57 citation - scientific - papers - citations - scholarly 85 57_citation_scientific_papers_citations
58 agreement - syntactic - verb - grammatical - subject verb 85 58_agreement_syntactic_verb_grammatical
59 metaphor - literal - figurative - metaphors - idiomatic 83 59_metaphor_literal_figurative_metaphors
60 srl - semantic role - role labeling - semantic role labeling - role 82 60_srl_semantic role_role labeling_semantic role labeling
61 privacy - private - federated - privacy preserving - federated learning 82 61_privacy_private_federated_privacy preserving
62 change - semantic change - time - semantic - lexical semantic 82 62_change_semantic change_time_semantic
63 bilingual - lingual - cross lingual - cross - embeddings 80 63_bilingual_lingual_cross lingual_cross
64 political - media - news - bias - articles 77 64_political_media_news_bias
65 medical - qa - question - questions - clinical 75 65_medical_qa_question_questions
66 math - mathematical - math word - word problems - problems 73 66_math_mathematical_math word_word problems
67 financial - stock - market - price - news 69 67_financial_stock_market_price
68 table - tables - tabular - reasoning - qa 69 68_table_tables_tabular_reasoning
69 readability - complexity - assessment - features - reading 65 69_readability_complexity_assessment_features
70 layout - document - documents - document understanding - extraction 64 70_layout_document_documents_document understanding
71 brain - cognitive - reading - syntactic - language 62 71_brain_cognitive_reading_syntactic
72 sign - gloss - language - signed - language translation 61 72_sign_gloss_language_signed
73 vqa - visual - visual question - visual question answering - question 59 73_vqa_visual_visual question_visual question answering
74 biased - biases - spurious - nlp - debiasing 57 74_biased_biases_spurious_nlp
75 visual - dialogue - multimodal - image - dialog 55 75_visual_dialogue_multimodal_image
76 translation - machine translation - machine - smt - statistical 54 76_translation_machine translation_machine_smt
77 multimodal - visual - image - translation - machine translation 52 77_multimodal_visual_image_translation
78 geographic - location - geolocation - geo - locations 51 78_geographic_location_geolocation_geo
79 reasoning - prompting - llms - chain thought - chain 48 79_reasoning_prompting_llms_chain thought
80 essay - scoring - aes - essay scoring - essays 45 80_essay_scoring_aes_essay scoring
81 crisis - disaster - traffic - tweets - disasters 45 81_crisis_disaster_traffic_tweets
82 graph - text classification - text - gcn - classification 44 82_graph_text classification_text_gcn
83 annotation - tools - linguistic - resources - xml 43 83_annotation_tools_linguistic_resources
84 entity alignment - alignment - kgs - entity - ea 43 84_entity alignment_alignment_kgs_entity
85 personality - traits - personality traits - evaluative - text 42 85_personality_traits_personality traits_evaluative
86 ad - alzheimer - alzheimer disease - disease - speech 40 86_ad_alzheimer_alzheimer disease_disease
87 taxonomy - hypernymy - taxonomies - hypernym - hypernyms 39 87_taxonomy_hypernymy_taxonomies_hypernym
88 active learning - active - al - learning - uncertainty 37 88_active learning_active_al_learning
89 reviews - summaries - summarization - review - opinion 36 89_reviews_summaries_summarization_review
90 emoji - emojis - sentiment - message - anonymous 35 90_emoji_emojis_sentiment_message
91 table - table text - tables - table text generation - text generation 35 91_table_table text_tables_table text generation
92 domain - domain adaptation - adaptation - domains - source 35 92_domain_domain adaptation_adaptation_domains
93 alignment - word alignment - parallel - pairs - alignments 34 93_alignment_word alignment_parallel_pairs
94 indo - languages - indo european - names - family 34 94_indo_languages_indo european_names
95 patent - claim - claim generation - chemical - technical 32 95_patent_claim_claim generation_chemical
96 agents - emergent - communication - referential - games 32 96_agents_emergent_communication_referential
97 graph - amr - graph text - graphs - text generation 31 97_graph_amr_graph text_graphs
98 moral - ethical - norms - values - social 29 98_moral_ethical_norms_values
99 acronym - acronyms - abbreviations - abbreviation - disambiguation 27 99_acronym_acronyms_abbreviations_abbreviation
100 typing - entity typing - entity - type - types 27 100_typing_entity typing_entity_type
101 coherence - discourse - discourse coherence - coherence modeling - text 26 101_coherence_discourse_discourse coherence_coherence modeling
102 pos - taggers - tagging - tagger - pos tagging 25 102_pos_taggers_tagging_tagger
103 drug - social - social media - media - health 25 103_drug_social_social media_media
104 gender - translation - bias - gender bias - mt 24 104_gender_translation_bias_gender bias
105 job - resume - skills - skill - soft 21 105_job_resume_skills_skill

Training Procedure

The model was trained as follows:

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import PartOfSpeech, KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# Prepare sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Summarization with ChatGPT
summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""
summarization_model = OpenAI(model="gpt-3.5-turbo", chat=True, prompt=summarization_prompt, nr_docs=5, exponential_backoff=True, diversity=0.1)

# Representation models
representation_models = {
    "POS": PartOfSpeech("en_core_web_lg"),
    "KeyBERTInspired": KeyBERTInspired(),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
    "KeyBERT + MMR": [KeyBERTInspired(), MaximalMarginalRelevance(diversity=0.3)],
    "OpenAI_Label": OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, diversity=0.1),
    "OpenAI_Summary": [KeyBERTInspired(), summarization_model],
}

# Fit BERTopic
topic_model= BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        representation_model=representation_models,
        verbose=True
).fit(docs)

Training hyperparameters

  • calculate_probabilities: False
  • language: None
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: None
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: True

Framework versions

  • Numpy: 1.22.4
  • HDBSCAN: 0.8.29
  • UMAP: 0.5.3
  • Pandas: 1.5.3
  • Scikit-Learn: 1.2.2
  • Sentence-transformers: 2.2.2
  • Transformers: 4.29.2
  • Numba: 0.56.4
  • Plotly: 5.13.1
  • Python: 3.10.11