DataComp

non-profit

https://www.datacomp.ai/dclm/index.html#home

AI & ML interests

None defined yet.

Recent Activity

greglindahl authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

thomwolf authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

Lewis-Lau authored a paper 2 days ago

T-Rex: Text-assisted Retrosynthesis Prediction

View all activity

dclm's activity

greglindahl

authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 4 days ago • 39

thomwolf

authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 4 days ago • 39

Lewis-Lau

authored 2 papers 2 days ago

T-Rex: Text-assisted Retrosynthesis Prediction

Paper • 2401.14637 • Published Jan 26, 2024

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published 7 days ago • 66

lx865712528

authored a paper 8 days ago

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Paper • 2501.04694 • Published 10 days ago • 9

lx865712528

authored a paper 19 days ago

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Paper • 2412.15797 • Published 29 days ago • 17

orionweller

authored 8 papers 30 days ago

NevIR: Negation in Neural Information Retrieval

Paper • 2305.07614 • Published May 12, 2023 • 1

Learning from Task Descriptions

Paper • 2011.08115 • Published Nov 16, 2020

MegaWika: Millions of reports and their sources across 50 diverse languages

Paper • 2307.07049 • Published Jul 13, 2023

Defending Against Poisoning Attacks in Open-Domain Question Answering

Paper • 2212.10002 • Published Dec 20, 2022

Learning to Reason via Program Generation, Emulation, and Search

Paper • 2405.16337 • Published May 25, 2024

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Paper • 2406.17186 • Published Jun 24, 2024 • 1

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Paper • 2409.11136 • Published Sep 17, 2024 • 22

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published about 1 month ago • 123

AmeyaPrabhu

authored a paper about 1 month ago

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Paper • 2412.06745 • Published Dec 9, 2024 • 6

ranpox

authored a paper about 1 month ago

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Paper • 2412.09605 • Published Dec 12, 2024 • 28

hlzhang109

authored 4 papers about 1 month ago

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Paper • 2304.03279 • Published Apr 6, 2023 • 1

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

Paper • 2406.10670 • Published Jun 15, 2024 • 4

DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published Jun 17, 2024 • 51

Eliminating Position Bias of Language Models: A Mechanistic Approach

Paper • 2407.01100 • Published Jul 1, 2024 • 6