--- tags: - bertopic library_name: bertopic pipeline_tag: text-classification --- # bertopic_github_dataset_viewer_issues This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. ## Usage To use this model, please install BERTopic: ``` pip install -U bertopic ``` You can use the model as follows: ```python from bertopic import BERTopic topic_model = BERTopic.load("asoria/bertopic_github_dataset_viewer_issues") topic_model.get_topic_info() ``` ## Topic overview * Number of topics: 78 * Number of training documents: 3066
Click here for an overview of all topics. | Topic ID | Topic Keywords | Topic Frequency | Label | |----------|----------------|-----------------|-------| | -1 | jobs - datasets - cache - fix - pandas | 11 | -1_jobs_datasets_cache_fix | | 0 | issue - viewer - dataset - for - bigsciencep3 | 534 | 0_issue_viewer_dataset_for | | 1 | parquet - files - metadata - parquetanddatasetinfo - configparquetandinfo | 144 | 1_parquet_files_metadata_parquetanddatasetinfo | | 2 | vulnerability - cryptography - dependencies - 4106 - update | 132 | 2_vulnerability_cryptography_dependencies_4106 | | 3 | docs - doc - page - add - md | 109 | 3_docs_doc_page_add | | 4 | rows - firstrows - row - truncated - response | 90 | 4_rows_firstrows_row_truncated | | 5 | duckdb - index - splitduckdbindex - fts - try | 78 | 5_duckdb_index_splitduckdbindex_fts | | 6 | hub - hubcache - timeout - datasethubcache - tags | 75 | 6_hub_hubcache_timeout_datasethubcache | | 7 | audio - opus - extension - torchaudio - torch | 59 | 7_audio_opus_extension_torchaudio | | 8 | filter - endpoint - isvalid - column - parameters | 54 | 8_filter_endpoint_isvalid_column | | 9 | datasets - update - upgrade - dependency - to | 54 | 9_datasets_update_upgrade_dependency | | 10 | docker - images - build - image - compose | 53 | 10_docker_images_build_image | | 11 | cache - refresh - entries - entry - warm | 51 | 11_cache_refresh_entries_entry | | 12 | mongo - mongodb - indexes - atlas - index | 48 | 12_mongo_mongodb_indexes_atlas | | 13 | image - images - modality - support - pdf2image | 47 | 13_image_images_modality_support | | 14 | unblock - block - blocked - blocklist - datasets | 46 | 14_unblock_block_blocked_blocklist | | 15 | error - expected - xerrorcode - messages - catch | 44 | 15_error_expected_xerrorcode_messages | | 16 | backfill - cron - job - time - move | 44 | 16_backfill_cron_job_time | | 17 | jobs - waiting - job - finishedat - started | 44 | 17_jobs_waiting_job_finishedat | | 18 | env - config - configs - vars - default | 41 | 18_env_config_configs_vars | | 19 | gitpython - 3137 - 3141 - github - builddepsdev | 41 | 19_gitpython_3137_3141_github | | 20 | assets - s3 - cachedassets - cached - fsspec | 40 | 20_assets_s3_cachedassets_cached | | 21 | splitnamesfromstreaming - split - streaming - rename - names | 39 | 21_splitnamesfromstreaming_split_streaming_rename | | 22 | statistics - stats - descriptive - splitdescriptivestatistics - class | 38 | 22_statistics_stats_descriptive_splitdescriptivestatistics | | 23 | private - gated - datasets - public - gatedauto | 35 | 23_private_gated_datasets_public | | 24 | metrics - healthcheck - port - adminmetrics - admin | 33 | 24_metrics_healthcheck_port_adminmetrics | | 25 | steps - processing - step - triggers - graph | 32 | 25_steps_processing_step_triggers | | 26 | ci - codecov - pr - fork - invalid | 31 | 26_ci_codecov_pr_fork | | 27 | splits - split - list - configs - returned | 31 | 27_splits_split_list_configs | | 28 | openapi - openapijson - spec - publish - spectral | 31 | 28_openapi_openapijson_spec_publish | | 29 | queue - incremental - based - field - jobs | 31 | 29_queue_incremental_based_field | | 30 | error - datasetwithscriptnotsupportederror - exist - no - datasetgenerationerror | 31 | 30_error_datasetwithscriptnotsupportederror_exist_no | | 31 | ram - 5gb - heavy - reduce - overcommitment | 31 | 31_ram_5gb_heavy_reduce | | 32 | workers - number - reduce - increase - heavy | 30 | 32_workers_number_reduce_increase | | 33 | admin - ui - app - difficulty - prefix | 30 | 33_admin_ui_app_difficulty | | 34 | chart - fixchart - helm - alb - featchart | 28 | 34_chart_fixchart_helm_alb | | 35 | aiohttp - 386 - bump - 392 - 391 | 27 | 35_aiohttp_386_bump_392 | | 36 | e2e - tests - test - ci - testmetrics | 27 | 36_e2e_tests_test_ci | | 37 | huggingfacehub - upgrade - 0151 - version - branch | 27 | 37_huggingfacehub_upgrade_0151_version | | 38 | test - tests - unit - pytestmemray - fixtures | 26 | 38_test_tests_unit_pytestmemray | | 39 | webhook - webhooks - payload - visibility - hub | 26 | 39_webhook_webhooks_payload_visibility | | 40 | migration - migrations - database - scripts - databases | 26 | 40_migration_migrations_database_scripts | | 41 | refactor - dead - code - remove - abstractions | 25 | 41_refactor_dead_code_remove | | 42 | retry - retryable - codes - every - createcommiterror | 25 | 42_retry_retryable_codes_every | | 43 | log - logs - debug - level - crashes | 25 | 43_log_logs_debug_level | | 44 | croissant - jsonld - fields - either - recordset | 25 | 44_croissant_jsonld_fields_either | | 45 | pods - pod - number - scale - reverseproxy | 24 | 45_pods_pod_number_scale | | 46 | scan - urls - spawning - presidio - optinouturls | 24 | 46_scan_urls_spawning_presidio | | 47 | resources - feat - reduce - increase - production | 22 | 47_resources_feat_reduce_increase | | 48 | download - manual - require - enum - extracted | 21 | 48_download_manual_require_enum | | 49 | comment - issues - close - fix - tag | 20 | 49_comment_issues_close_fix | | 50 | cache - entries - clean - hf - blocked | 19 | 50_cache_entries_clean_hf | | 51 | worker - generic - workerjobtypesblocked - treccartools - dependencies | 19 | 51_worker_generic_workerjobtypesblocked_treccartools | | 52 | datasetviewer - rename - datasetsserver - domain - server | 18 | 52_datasetviewer_rename_datasetsserver_domain | | 53 | across - group - pip - directories - bump | 18 | 53_across_group_pip_directories | | 54 | runner - runners - validation - job - parent | 18 | 54_runner_runners_validation_job | | 55 | upgrade - datasets - feat - 221 - 1162dev0 | 18 | 55_upgrade_datasets_feat_221 | | 56 | jwt - array - authorization - cookies - bypass | 18 | 56_jwt_array_authorization_cookies | | 57 | allow - script - scriptbased - scripts - redpajamadata1t | 17 | 57_allow_script_scriptbased_scripts | | 58 | unique - metrics - metric - cache - cron | 16 | 58_unique_metrics_metric_cache | | 59 | aiohttp - libslibcommon - libslibapi - 386 - 385 | 16 | 59_aiohttp_libslibcommon_libslibapi_386 | | 60 | pillow - 1001 - 1020 - bump - from | 16 | 60_pillow_1001_1020_bump | | 61 | storage - disk - storageclient - storageadmin - client | 15 | 61_storage_disk_storageclient_storageadmin | | 62 | resources - increase - 108010 - reduce - 2468 | 15 | 62_resources_increase_108010_reduce | | 63 | poetry - dependabot - align - version - 20 | 14 | 63_poetry_dependabot_align_version | | 64 | upgrade - datasets - 188 - pufanyimimicit - meaning | 14 | 64_upgrade_datasets_188_pufanyimimicit | | 65 | auth - authentication - asynchronous - authcheck - 307 | 14 | 65_auth_authentication_asynchronous_authcheck | | 66 | lock - locks - finishing - release - ttl | 14 | 66_lock_locks_finishing_release | | 67 | nginx - proxy - reverse - reverseproxy - 1253 | 14 | 67_nginx_proxy_reverse_reverseproxy | | 68 | orjson - 3915 - 390 - bump - from | 13 | 68_orjson_3915_390_bump | | 69 | gradio - 3340 - 4110 - frontadminui - upgrade | 13 | 69_gradio_3340_4110_frontadminui | | 70 | starlette - 0280 - 0362 - bump - 0231 | 13 | 70_starlette_0280_0362_bump | | 71 | secrets - fixs3 - correct - secret - name | 13 | 71_secrets_fixs3_correct_secret | | 72 | search - elastic - functionality - times - currently | 13 | 72_search_elastic_functionality_times | | 73 | token - hftoken - app - secret - hf | 12 | 73_token_hftoken_app_secret | | 74 | efs - nfs - mount - parquetmetadata - storage | 12 | 74_efs_nfs_mount_parquetmetadata | | 75 | ruff - vscode - 045 - settings - ruffcache | 12 | 75_ruff_vscode_045_settings | | 76 | kubernetes - kube - infrastructure - pdb - disruption | 12 | 76_kubernetes_kube_infrastructure_pdb |
## Training hyperparameters * calculate_probabilities: False * language: english * low_memory: False * min_topic_size: 10 * n_gram_range: (1, 1) * nr_topics: None * seed_topic_list: None * top_n_words: 10 * verbose: False * zeroshot_min_similarity: 0.7 * zeroshot_topic_list: None ## Framework versions * Numpy: 1.26.4 * HDBSCAN: 0.8.38.post1 * UMAP: 0.5.6 * Pandas: 2.1.4 * Scikit-Learn: 1.5.2 * Sentence-transformers: 3.1.1 * Transformers: 4.44.2 * Numba: 0.60.0 * Plotly: 5.24.1 * Python: 3.10.12