metadata

tags:
  - bertopic
library_name: bertopic
pipeline_tag: text-classification

BERTopic_ArXiv

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This pre-trained model demonstrates the use of several representation models that can be used within BERTopic. This model was trained on ~30000 ArXiv abstracts with the following topic representation methods (bertopic.representation):

POS
KeyBERTInspired
MaximalMarginalRelevance
KeyBERT + MaximalMarginalRelevance
ChatGPT labels
ChatGPT summaries

An example of the default c-TF-IDF representations:

An example of labels generated by ChatGPT (gpt-3.5-turbo):

Usage

To use this model, please install BERTopic:

pip install -U bertopic
pip install -U safetensors

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

topic_model.get_topic_info()

To view all different topic representations (keywords, labels, summary, etc.) you can run the following:

topic_model.get_topic(1, full=True)

Topic overview

Number of topics: 107
Number of training documents: 33189

Click here for an overview of all topics.

Topic ID	Topic Keywords	Topic Frequency	Label
-1	language - models - model - data - based	20	-1_language_models_model_data
0	dialogue - dialog - response - responses - intent	14247	0_dialogue_dialog_response_responses
1	speech - asr - speech recognition - recognition - end	1833	1_speech_asr_speech recognition_recognition
2	tuning - tasks - prompt - models - language	1369	2_tuning_tasks_prompt_models
3	summarization - summaries - summary - abstractive - document	1109	3_summarization_summaries_summary_abstractive
4	question - answer - qa - answering - question answering	893	4_question_answer_qa_answering
5	sentiment - sentiment analysis - aspect - analysis - opinion	837	5_sentiment_sentiment analysis_aspect_analysis
6	clinical - medical - biomedical - notes - patient	691	6_clinical_medical_biomedical_notes
7	translation - nmt - machine translation - neural machine - neural machine translation	586	7_translation_nmt_machine translation_neural machine
8	generation - text generation - text - language generation - nlg	558	8_generation_text generation_text_language generation
9	hate - hate speech - offensive - speech - detection	484	9_hate_hate speech_offensive_speech
10	news - fake - fake news - stance - fact	455	10_news_fake_fake news_stance
11	relation - relation extraction - extraction - relations - entity	450	11_relation_relation extraction_extraction_relations
12	ner - named - named entity - entity - named entity recognition	376	12_ner_named_named entity_entity
13	parsing - parser - dependency - treebank - parsers	370	13_parsing_parser_dependency_treebank
14	event - temporal - events - event extraction - extraction	314	14_event_temporal_events_event extraction
15	emotion - emotions - multimodal - emotion recognition - emotional	300	15_emotion_emotions_multimodal_emotion recognition
16	word - embeddings - word embeddings - embedding - words	292	16_word_embeddings_word embeddings_embedding
17	explanations - explanation - rationales - rationale - interpretability	212	17_explanations_explanation_rationales_rationale
18	morphological - arabic - morphology - languages - inflection	204	18_morphological_arabic_morphology_languages
19	topic - topics - topic models - lda - topic modeling	200	19_topic_topics_topic models_lda
20	bias - gender - biases - gender bias - debiasing	195	20_bias_gender_biases_gender bias
21	law - frequency - zipf - words - length	185	21_law_frequency_zipf_words
22	legal - court - law - legal domain - case	182	22_legal_court_law_legal domain
23	adversarial - attacks - attack - adversarial examples - robustness	181	23_adversarial_attacks_attack_adversarial examples
24	commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning	180	24_commonsense_commonsense knowledge_reasoning_knowledge
25	quantum - semantics - calculus - compositional - meaning	171	25_quantum_semantics_calculus_compositional
26	correction - error - error correction - grammatical - grammatical error	161	26_correction_error_error correction_grammatical
27	argument - arguments - argumentation - argumentative - mining	160	27_argument_arguments_argumentation_argumentative
28	sarcasm - humor - sarcastic - detection - humorous	157	28_sarcasm_humor_sarcastic_detection
29	coreference - resolution - coreference resolution - mentions - mention	156	29_coreference_resolution_coreference resolution_mentions
30	sense - word sense - wsd - word - disambiguation	153	30_sense_word sense_wsd_word
31	knowledge - knowledge graph - graph - link prediction - entities	149	31_knowledge_knowledge graph_graph_link prediction
32	parsing - semantic parsing - amr - semantic - parser	146	32_parsing_semantic parsing_amr_semantic
33	cross lingual - lingual - cross - transfer - languages	146	33_cross lingual_lingual_cross_transfer
34	mt - translation - qe - quality - machine translation	139	34_mt_translation_qe_quality
35	sql - text sql - queries - spider - schema	138	35_sql_text sql_queries_spider
36	classification - text classification - label - text - labels	136	36_classification_text classification_label_text
37	style - style transfer - transfer - text style - text style transfer	136	37_style_style transfer_transfer_text style
38	question - question generation - questions - answer - generation	129	38_question_question generation_questions_answer
39	authorship - authorship attribution - attribution - author - authors	127	39_authorship_authorship attribution_attribution_author
40	sentence - sentence embeddings - similarity - sts - sentence embedding	123	40_sentence_sentence embeddings_similarity_sts
41	code - identification - switching - cs - code switching	121	41_code_identification_switching_cs
42	story - stories - story generation - generation - storytelling	118	42_story_stories_story generation_generation
43	discourse - discourse relation - discourse relations - rst - discourse parsing	117	43_discourse_discourse relation_discourse relations_rst
44	code - programming - source code - code generation - programming languages	117	44_code_programming_source code_code generation
45	paraphrase - paraphrases - paraphrase generation - paraphrasing - generation	114	45_paraphrase_paraphrases_paraphrase generation_paraphrasing
46	agent - games - environment - instructions - agents	111	46_agent_games_environment_instructions
47	covid - covid 19 - 19 - tweets - pandemic	108	47_covid_covid 19_19_tweets
48	linking - entity linking - entity - el - entities	107	48_linking_entity linking_entity_el
49	poetry - poems - lyrics - poem - music	103	49_poetry_poems_lyrics_poem
50	image - captioning - captions - visual - caption	100	50_image_captioning_captions_visual
51	nli - entailment - inference - natural language inference - language inference	96	51_nli_entailment_inference_natural language inference
52	keyphrase - keyphrases - extraction - document - phrases	95	52_keyphrase_keyphrases_extraction_document
53	simplification - text simplification - ts - sentence - simplified	95	53_simplification_text simplification_ts_sentence
54	empathetic - emotion - emotional - empathy - emotions	95	54_empathetic_emotion_emotional_empathy
55	depression - mental - health - mental health - social media	93	55_depression_mental_health_mental health
56	segmentation - word segmentation - chinese - chinese word segmentation - chinese word	93	56_segmentation_word segmentation_chinese_chinese word segmentation
57	citation - scientific - papers - citations - scholarly	85	57_citation_scientific_papers_citations
58	agreement - syntactic - verb - grammatical - subject verb	85	58_agreement_syntactic_verb_grammatical
59	metaphor - literal - figurative - metaphors - idiomatic	83	59_metaphor_literal_figurative_metaphors
60	srl - semantic role - role labeling - semantic role labeling - role	82	60_srl_semantic role_role labeling_semantic role labeling
61	privacy - private - federated - privacy preserving - federated learning	82	61_privacy_private_federated_privacy preserving
62	change - semantic change - time - semantic - lexical semantic	82	62_change_semantic change_time_semantic
63	bilingual - lingual - cross lingual - cross - embeddings	80	63_bilingual_lingual_cross lingual_cross
64	political - media - news - bias - articles	77	64_political_media_news_bias
65	medical - qa - question - questions - clinical	75	65_medical_qa_question_questions
66	math - mathematical - math word - word problems - problems	73	66_math_mathematical_math word_word problems
67	financial - stock - market - price - news	69	67_financial_stock_market_price
68	table - tables - tabular - reasoning - qa	69	68_table_tables_tabular_reasoning
69	readability - complexity - assessment - features - reading	65	69_readability_complexity_assessment_features
70	layout - document - documents - document understanding - extraction	64	70_layout_document_documents_document understanding
71	brain - cognitive - reading - syntactic - language	62	71_brain_cognitive_reading_syntactic
72	sign - gloss - language - signed - language translation	61	72_sign_gloss_language_signed
73	vqa - visual - visual question - visual question answering - question	59	73_vqa_visual_visual question_visual question answering
74	biased - biases - spurious - nlp - debiasing	57	74_biased_biases_spurious_nlp
75	visual - dialogue - multimodal - image - dialog	55	75_visual_dialogue_multimodal_image
76	translation - machine translation - machine - smt - statistical	54	76_translation_machine translation_machine_smt
77	multimodal - visual - image - translation - machine translation	52	77_multimodal_visual_image_translation
78	geographic - location - geolocation - geo - locations	51	78_geographic_location_geolocation_geo
79	reasoning - prompting - llms - chain thought - chain	48	79_reasoning_prompting_llms_chain thought
80	essay - scoring - aes - essay scoring - essays	45	80_essay_scoring_aes_essay scoring
81	crisis - disaster - traffic - tweets - disasters	45	81_crisis_disaster_traffic_tweets
82	graph - text classification - text - gcn - classification	44	82_graph_text classification_text_gcn
83	annotation - tools - linguistic - resources - xml	43	83_annotation_tools_linguistic_resources
84	entity alignment - alignment - kgs - entity - ea	43	84_entity alignment_alignment_kgs_entity
85	personality - traits - personality traits - evaluative - text	42	85_personality_traits_personality traits_evaluative
86	ad - alzheimer - alzheimer disease - disease - speech	40	86_ad_alzheimer_alzheimer disease_disease
87	taxonomy - hypernymy - taxonomies - hypernym - hypernyms	39	87_taxonomy_hypernymy_taxonomies_hypernym
88	active learning - active - al - learning - uncertainty	37	88_active learning_active_al_learning
89	reviews - summaries - summarization - review - opinion	36	89_reviews_summaries_summarization_review
90	emoji - emojis - sentiment - message - anonymous	35	90_emoji_emojis_sentiment_message
91	table - table text - tables - table text generation - text generation	35	91_table_table text_tables_table text generation
92	domain - domain adaptation - adaptation - domains - source	35	92_domain_domain adaptation_adaptation_domains
93	alignment - word alignment - parallel - pairs - alignments	34	93_alignment_word alignment_parallel_pairs
94	indo - languages - indo european - names - family	34	94_indo_languages_indo european_names
95	patent - claim - claim generation - chemical - technical	32	95_patent_claim_claim generation_chemical
96	agents - emergent - communication - referential - games	32	96_agents_emergent_communication_referential
97	graph - amr - graph text - graphs - text generation	31	97_graph_amr_graph text_graphs
98	moral - ethical - norms - values - social	29	98_moral_ethical_norms_values
99	acronym - acronyms - abbreviations - abbreviation - disambiguation	27	99_acronym_acronyms_abbreviations_abbreviation
100	typing - entity typing - entity - type - types	27	100_typing_entity typing_entity_type
101	coherence - discourse - discourse coherence - coherence modeling - text	26	101_coherence_discourse_discourse coherence_coherence modeling
102	pos - taggers - tagging - tagger - pos tagging	25	102_pos_taggers_tagging_tagger
103	drug - social - social media - media - health	25	103_drug_social_social media_media
104	gender - translation - bias - gender bias - mt	24	104_gender_translation_bias_gender bias
105	job - resume - skills - skill - soft	21	105_job_resume_skills_skill

Training Procedure

The model was trained as follows:

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import PartOfSpeech, KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# Prepare sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Summarization with ChatGPT
summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""
summarization_model = OpenAI(model="gpt-3.5-turbo", chat=True, prompt=summarization_prompt, nr_docs=5, exponential_backoff=True, diversity=0.1)

# Representation models
representation_models = {
    "POS": PartOfSpeech("en_core_web_lg"),
    "KeyBERTInspired": KeyBERTInspired(),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
    "KeyBERT + MMR": [KeyBERTInspired(), MaximalMarginalRelevance(diversity=0.3)],
    "OpenAI_Label": OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, diversity=0.1),
    "OpenAI_Summary": [KeyBERTInspired(), summarization_model],
}

# Fit BERTopic
topic_model= BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        representation_model=representation_models,
        verbose=True
).fit(docs)

Training hyperparameters

calculate_probabilities: False
language: None
low_memory: False
min_topic_size: 10
n_gram_range: (1, 1)
nr_topics: None
seed_topic_list: None
top_n_words: 10
verbose: True

Framework versions

Numpy: 1.22.4
HDBSCAN: 0.8.29
UMAP: 0.5.3
Pandas: 1.5.3
Scikit-Learn: 1.2.2
Sentence-transformers: 2.2.2
Transformers: 4.29.2
Numba: 0.56.4
Plotly: 5.13.1
Python: 3.10.11