Post
270
Stratified LLM Subsets: Balanced Training Data at 100K-1M Scale
Released three training datasets using embedding-based k-means clustering to create balanced subsets from large-scale corpora:
Interactive cluster visualization:
https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/
Pre-Training (FineWeb-Edu + Proof-Pile-2)
AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M
Instruction-Following (Tulu-3 + Orca AgentInstruct)
AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
Reasoning (Llama-Nemotron with sqrt balancing)
AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M
Methodology: k-means clustering on Snowflake Arctic-embed-xs embeddings (100 iterations), selecting cluster centroids as representatives. Balancing applied to imbalanced datasets to reduce category dominance.
Available at 50k, 100k, 250k, 500k, and 1M scales.
Released three training datasets using embedding-based k-means clustering to create balanced subsets from large-scale corpora:
Interactive cluster visualization:
https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/
Pre-Training (FineWeb-Edu + Proof-Pile-2)
AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M
Instruction-Following (Tulu-3 + Orca AgentInstruct)
AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
Reasoning (Llama-Nemotron with sqrt balancing)
AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M
Methodology: k-means clustering on Snowflake Arctic-embed-xs embeddings (100 iterations), selecting cluster centroids as representatives. Balancing applied to imbalanced datasets to reduce category dominance.
Available at 50k, 100k, 250k, 500k, and 1M scales.