-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 143 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 11 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 50 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 44
Collections
Discover the best community collections!
Collections including paper arxiv:2404.03543
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 21 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 6 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 1
-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 16 -
Large Language Models as Planning Domain Generators
Paper • 2405.06650 • Published • 9 -
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Paper • 2404.12753 • Published • 41 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 44
-
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Paper • 2406.11931 • Published • 57 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 31 -
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Paper • 2408.07060 • Published • 40
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 183 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 24 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 33
-
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 64 -
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 80 -
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 16
-
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
McEval: Massively Multilingual Code Evaluation
Paper • 2406.07436 • Published • 39 -
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Paper • 2406.15877 • Published • 45 -
Qwen2 Technical Report
Paper • 2407.10671 • Published • 155
-
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Paper • 2403.09029 • Published • 54 -
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper • 2403.12968 • Published • 24 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 67 -
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Paper • 2403.09629 • Published • 72