Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
Recent Activity
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation ⢠0.2B ⢠Updated ⢠1.01k ⢠5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation ⢠0.2B ⢠Updated ⢠6 ⢠2 -
BEE-spoke-data/beecoder-220M-python
Text Generation ⢠0.2B ⢠Updated ⢠5 ⢠3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation ⢠0.2B ⢠Updated ⢠623 ⢠1
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation ⢠1B ⢠Updated ⢠7 ⢠2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation ⢠1B ⢠Updated ⢠5 ⢠1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation ⢠8B ⢠Updated ⢠7 -
BEE-spoke-data/phi-1bee5
Text Generation ⢠1B ⢠Updated ⢠4 ⢠1
trained and adapted tokenizers - various
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠1.66k ⢠29 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠0.1B ⢠Updated ⢠1.22k ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠2.06k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠0.1B ⢠Updated ⢠1.02k ⢠4
Pretrained encoder (fill-mask) models we made
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification ⢠0.2B ⢠Updated ⢠2 ⢠2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification ⢠0.0B ⢠Updated ⢠1 ⢠1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification ⢠0.4B ⢠Updated ⢠1 ⢠1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification ⢠0.1B ⢠Updated ⢠2
concept datasets extracted from fineweb
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠1.66k ⢠29 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠0.1B ⢠Updated ⢠1.22k ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠2.06k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠0.1B ⢠Updated ⢠1.02k ⢠4
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation ⢠0.2B ⢠Updated ⢠1.01k ⢠5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation ⢠0.2B ⢠Updated ⢠6 ⢠2 -
BEE-spoke-data/beecoder-220M-python
Text Generation ⢠0.2B ⢠Updated ⢠5 ⢠3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation ⢠0.2B ⢠Updated ⢠623 ⢠1
Pretrained encoder (fill-mask) models we made
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation ⢠1B ⢠Updated ⢠7 ⢠2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation ⢠1B ⢠Updated ⢠5 ⢠1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation ⢠8B ⢠Updated ⢠7 -
BEE-spoke-data/phi-1bee5
Text Generation ⢠1B ⢠Updated ⢠4 ⢠1
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification ⢠0.2B ⢠Updated ⢠2 ⢠2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification ⢠0.0B ⢠Updated ⢠1 ⢠1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification ⢠0.4B ⢠Updated ⢠1 ⢠1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification ⢠0.1B ⢠Updated ⢠2
trained and adapted tokenizers - various
concept datasets extracted from fineweb