Aman Priyanshu's picture

Open to Collab

Aman Priyanshu PRO

AmanPriyanshu

·

https://amanpriyanshu.github.io/

AI & ML interests

PPML, Ethical-AI, Responsible-AI, Privacy-Preserving-Machine-Learning, AI-Safety, AI-Security

Recent Activity

liked a dataset about 4 hours ago

SupritiVijay/dr-tulu-sft-deep-research-agent-data-cleaned-rectified

liked a dataset about 6 hours ago

trendmicro-ailab/Primus-FineWeb

liked a dataset about 17 hours ago

nyuuzyou/google-code-archive

View all activity

Organizations

Posts 2

Post

487

Stratified LLM Subsets: Balanced Training Data at 100K-1M Scale

Released three training datasets using embedding-based k-means clustering to create balanced subsets from large-scale corpora:

Interactive cluster visualization:
https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/

Pre-Training (FineWeb-Edu + Proof-Pile-2)
AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M

Instruction-Following (Tulu-3 + Orca AgentInstruct)
AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M

Reasoning (Llama-Nemotron with sqrt balancing)
AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M

Methodology: k-means clustering on Snowflake Arctic-embed-xs embeddings (100 iterations), selecting cluster centroids as representatives. Balancing applied to imbalanced datasets to reduce category dominance.

Available at 50k, 100k, 250k, 500k, and 1M scales.

Articles 1

Article

1

Dynamic Topic Modeling with RedPajama: A New Approach to Hierarchical Content Understanding

View all Articles

Collections 10

View 10 collections

Papers 5

arxiv:2601.21051

arxiv:2508.01059

arxiv:2504.21039

arxiv:2408.16163

models 236

AmanPriyanshu/gpt-oss-20.9b-specialized-harmful-pruned-moe-only-32-experts

Text Generation • 21B • Updated Aug 13, 2025 • 4 • 1

AmanPriyanshu/gpt-oss-20.3b-specialized-harmful-pruned-moe-only-31-experts

Text Generation • 20B • Updated Aug 13, 2025 • 3 • 1

AmanPriyanshu/gpt-oss-19.7b-specialized-harmful-pruned-moe-only-30-experts

Text Generation • 20B • Updated Aug 13, 2025 • 2 • 1

AmanPriyanshu/gpt-oss-19.1b-specialized-harmful-pruned-moe-only-29-experts

Text Generation • 19B • Updated Aug 13, 2025 • 6 • 1

AmanPriyanshu/gpt-oss-18.5b-specialized-harmful-pruned-moe-only-28-experts

Text Generation • 19B • Updated Aug 13, 2025 • 1 • 1

AmanPriyanshu/gpt-oss-17.9b-specialized-harmful-pruned-moe-only-27-experts

Text Generation • 18B • Updated Aug 13, 2025 • 2 • 1

AmanPriyanshu/gpt-oss-17.3b-specialized-harmful-pruned-moe-only-26-experts

Text Generation • 17B • Updated Aug 13, 2025 • 1 • 1

AmanPriyanshu/gpt-oss-20.9b-specialized-instruction_following-pruned-moe-only-32-experts

Text Generation • 21B • Updated Aug 13, 2025 • 1 • 1

AmanPriyanshu/gpt-oss-16.7b-specialized-harmful-pruned-moe-only-25-experts

Text Generation • 17B • Updated Aug 13, 2025 • 3 • 1

AmanPriyanshu/gpt-oss-20.3b-specialized-instruction_following-pruned-moe-only-31-experts

Text Generation • 20B • Updated Aug 13, 2025 • 1 • 1

View 236 models

datasets 12

AmanPriyanshu/rlvr-guru-raw-data-extended

Viewer • Updated Oct 20, 2025 • 226k • 70

AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M

Viewer • Updated Oct 4, 2025 • 1.9M • 29

AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M

Viewer • Updated Oct 4, 2025 • 2.22M • 259 • 1

AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M

Viewer • Updated Oct 4, 2025 • 1.9M • 19

AmanPriyanshu/Random-Code-Prompts

Viewer • Updated Aug 27, 2025 • 10k • 21

AmanPriyanshu/GPT-OSS-20B-benchmark-rollouts-512-tokens

Viewer • Updated Aug 12, 2025 • 36k • 18

AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations

Preview • Updated Aug 8, 2025 • 29 • 7

AmanPriyanshu/Tiny-Maze-Mock-GRPO

Viewer • Updated Aug 7, 2025 • 100k • 20

AmanPriyanshu/GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

Viewer • Updated Jan 31, 2025 • 100k • 6

AmanPriyanshu/clone-of-gretel-financial-risk-analysis-v1

Viewer • Updated Dec 17, 2024 • 1.03k • 49

View 12 datasets