Papers - a Imotech Collection

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 26

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Paper • 2406.13897 • Published May 30, 2024 • 12

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Paper • 2407.13759 • Published Jul 18, 2024 • 18

POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Paper • 2407.14931 • Published Jul 20, 2024 • 22

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 29

DistilDIRE: A Small, Fast, Cheap and Lightweight Diffusion Synthesized Deepfake Detection

Paper • 2406.00856 • Published Jun 2, 2024 • 12

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Paper • 2407.16741 • Published Jul 23, 2024 • 73

3D Question Answering for City Scene Understanding

Paper • 2407.17398 • Published Jul 24, 2024 • 22

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Paper • 2407.20229 • Published Jul 29, 2024 • 7

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 116

RelBench: A Benchmark for Deep Learning on Relational Databases

Paper • 2407.20060 • Published Jul 29, 2024 • 10

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Paper • 2408.02545 • Published Aug 5, 2024 • 38

MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

Paper • 2408.02555 • Published Aug 5, 2024 • 32

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Paper • 2408.03256 • Published Aug 6, 2024 • 10

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 60

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8, 2024 • 172

FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework

Paper • 2408.06190 • Published Aug 12, 2024 • 18

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Paper • 2408.07060 • Published Aug 13, 2024 • 42

Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 62

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Paper • 2408.13252 • Published Aug 23, 2024 • 26

MuCodec: Ultra Low-Bitrate Music Codec

Paper • 2409.13216 • Published Sep 20, 2024 • 23

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 140

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19, 2024 • 25

FlexiTex: Enhancing Texture Generation with Visual Guidance

Paper • 2409.12431 • Published Sep 19, 2024 • 13

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Paper • 2409.12892 • Published Sep 19, 2024 • 5

SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending

Paper • 2409.13926 • Published Sep 20, 2024 • 6

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23, 2024 • 25

Improvements to SDXL in NovelAI Diffusion V3

Paper • 2409.15997 • Published Sep 24, 2024 • 13

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Paper • 2409.17115 • Published Sep 25, 2024 • 63

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Paper • 2409.18125 • Published Sep 26, 2024 • 34

Game4Loc: A UAV Geo-Localization Benchmark from Game Data

Paper • 2409.16925 • Published Sep 25, 2024 • 8

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Paper • 2409.20563 • Published Sep 30, 2024 • 9

Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

Paper • 2410.00418 • Published Oct 1, 2024 • 10

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

Paper • 2410.00337 • Published Oct 1, 2024 • 11

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

Paper • 2410.00890 • Published Oct 1, 2024 • 20

Law of the Weakest Link: Cross Capabilities of Large Language Models

Paper • 2409.19951 • Published Sep 30, 2024 • 54

Illustrious: an Open Advanced Illustration Model

Paper • 2409.19946 • Published Sep 30, 2024 • 16

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Paper • 2410.01215 • Published Oct 2, 2024 • 40

3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection

Paper • 2410.01647 • Published Oct 2, 2024 • 31

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 151

MIGA: Mixture-of-Experts with Group Aggregation for Stock Market Prediction

Paper • 2410.02241 • Published Oct 3, 2024 • 8

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

Paper • 2410.01273 • Published Oct 2, 2024 • 11

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2, 2024 • 14

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Paper • 2410.03864 • Published Oct 4, 2024 • 12

Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach

Paper • 2410.06949 • Published Oct 9, 2024 • 6

Data Selection via Optimal Control for Language Models

Paper • 2410.07064 • Published Oct 9, 2024 • 9

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Paper • 2410.07171 • Published Oct 9, 2024 • 43

Does Spatial Cognition Emerge in Frontier Models?

Paper • 2410.06468 • Published Oct 9, 2024 • 2

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4, 2024 • 36

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 26

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Paper • 2410.05265 • Published Oct 7, 2024 • 32

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13, 2024 • 55

Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

Paper • 2410.09584 • Published Oct 12, 2024 • 48

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 57

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Paper • 2410.10563 • Published Oct 14, 2024 • 38

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Paper • 2410.10792 • Published Oct 14, 2024 • 31

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14, 2024 • 29

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Paper • 2410.09335 • Published Oct 12, 2024 • 17

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 87

Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Paper • 2410.09037 • Published Oct 11, 2024 • 4

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

Paper • 2410.09008 • Published Oct 11, 2024 • 17

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Paper • 2410.08102 • Published Oct 10, 2024 • 21

StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

Paper • 2410.08815 • Published Oct 11, 2024 • 47

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 37

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Paper • 2410.08261 • Published Oct 10, 2024 • 52

FlatQuant: Flatness Matters for LLM Quantization

Paper • 2410.09426 • Published Oct 12, 2024 • 16

Harnessing Webpage UIs for Text-Rich Visual Understanding

Paper • 2410.13824 • Published Oct 17, 2024 • 31

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Paper • 2410.13757 • Published Oct 17, 2024 • 33

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9

AERO: Softmax-Only LLMs for Efficient Private Inference

Paper • 2410.13060 • Published Oct 16, 2024 • 4

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Paper • 2410.13754 • Published Oct 17, 2024 • 75

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Paper • 2410.13370 • Published Oct 17, 2024 • 37

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

Paper • 2410.13674 • Published Oct 17, 2024 • 17

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Paper • 2410.13726 • Published Oct 17, 2024 • 12

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Paper • 2410.10812 • Published Oct 14, 2024 • 18

AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 59

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 56

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Paper • 2410.14745 • Published Oct 17, 2024 • 47

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Paper • 2410.16153 • Published Oct 21, 2024 • 44

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

Paper • 2410.12788 • Published Oct 16, 2024 • 23

DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Paper • 2410.18084 • Published Oct 23, 2024 • 14

Lightweight Neural App Control

Paper • 2410.17883 • Published Oct 23, 2024 • 10

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Paper • 2410.13924 • Published Oct 17, 2024 • 8

LOGO -- Long cOntext aliGnment via efficient preference Optimization

Paper • 2410.18533 • Published Oct 24, 2024 • 43

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Paper • 2410.18693 • Published Oct 24, 2024 • 42

Framer: Interactive Frame Interpolation

Paper • 2410.18978 • Published Oct 24, 2024 • 37

Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Paper • 2410.18798 • Published Oct 24, 2024 • 21

WAFFLE: Multi-Modal Model for Automated Front-End Development

Paper • 2410.18362 • Published Oct 24, 2024 • 13

mistralai/Pixtral-12B-Base-2409

Updated Jul 28 • 18 • 104

Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

Paper • 2410.19290 • Published Oct 25, 2024 • 10

Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21, 2024 • 29

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Paper • 2410.19168 • Published Oct 24, 2024 • 22

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 80

GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 84

Neural Fields in Robotics: A Survey

Paper • 2410.20220 • Published Oct 26, 2024 • 5

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published Oct 28, 2024 • 11

AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

Paper • 2410.18603 • Published Oct 24, 2024 • 32

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Paper • 2410.18666 • Published Oct 24, 2024 • 19

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Paper • 2410.21465 • Published Oct 28, 2024 • 11

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 209

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Paper • 2410.20424 • Published Oct 27, 2024 • 40

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Paper • 2410.23168 • Published Oct 30, 2024 • 24

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Paper • 2410.20650 • Published Oct 28, 2024 • 17

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Paper • 2410.22366 • Published Oct 28, 2024 • 83

Language Models can Self-Lengthen to Generate Long Texts

Paper • 2410.23933 • Published Oct 31, 2024 • 18

SelfCodeAlign: Self-Alignment for Code Generation

Paper • 2410.24198 • Published Oct 31, 2024 • 24

Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

Paper • 2410.24032 • Published Oct 31, 2024 • 10

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Paper • 2410.21157 • Published Oct 28, 2024 • 6

Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1, 2024 • 8

HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

Paper • 2410.22901 • Published Oct 30, 2024 • 8

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

Paper • 2411.00771 • Published Nov 1, 2024 • 9

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Paper • 2410.24024 • Published Oct 31, 2024 • 49

Training-free Regional Prompting for Diffusion Transformers

Paper • 2411.02395 • Published Nov 4, 2024 • 25

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4, 2024 • 51

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 24

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

Paper • 2411.03047 • Published Nov 5, 2024 • 9

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Paper • 2411.02959 • Published Nov 5, 2024 • 70

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5, 2024 • 26

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Paper • 2411.04905 • Published Nov 7, 2024 • 127

RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval

Paper • 2411.04752 • Published Nov 7, 2024 • 17

SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Paper • 2411.05007 • Published Nov 7, 2024 • 22

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Paper • 2411.04952 • Published Nov 7, 2024 • 30

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Paper • 2411.04965 • Published Nov 7, 2024 • 69

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Paper • 2411.06176 • Published Nov 9, 2024 • 45

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published Nov 11, 2024 • 30

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 50

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Paper • 2411.04954 • Published Nov 7, 2024 • 10

PramaLLC/BEN

Image Segmentation • Updated Jan 26 • 20 • 91

SAMPart3D: Segment Any Part in 3D Objects

Paper • 2411.07184 • Published Nov 11, 2024 • 28

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Paper • 2411.09595 • Published Nov 14, 2024 • 77

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 78

Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 66

GenXD: Generating Any 3D and 4D Scenes

Paper • 2411.02319 • Published Nov 4, 2024 • 20

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 129

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Paper • 2411.10510 • Published Nov 15, 2024 • 8

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Paper • 2411.10669 • Published Nov 16, 2024 • 10

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Paper • 2411.09944 • Published Nov 15, 2024 • 12

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 46

Continuous Speculative Decoding for Autoregressive Image Generation

Paper • 2411.11925 • Published Nov 18, 2024 • 16

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 56

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Paper • 2411.12044 • Published Nov 18, 2024 • 14

Building Trust: Foundations of Security, Safety and Transparency in AI

Paper • 2411.12275 • Published Nov 19, 2024 • 11

SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

Paper • 2411.10161 • Published Nov 15, 2024 • 9

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 56

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 19

Ultra-Sparse Memory Network

Paper • 2411.12364 • Published Nov 19, 2024 • 23

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Paper • 2411.14199 • Published Nov 21, 2024 • 31

Natural Language Reinforcement Learning

Paper • 2411.14251 • Published Nov 21, 2024 • 30

Patience Is The Key to Large Language Model Reasoning

Paper • 2411.13082 • Published Nov 20, 2024 • 7

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Paper • 2411.14347 • Published Nov 21, 2024 • 15

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Paper • 2411.13807 • Published Nov 21, 2024 • 11

Hymba: A Hybrid-head Architecture for Small Language Models

Paper • 2411.13676 • Published Nov 20, 2024 • 45

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Paper • 2411.14405 • Published Nov 21, 2024 • 61

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

Paper • 2411.15124 • Published Nov 22, 2024 • 66

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 19

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Paper • 2411.14794 • Published Nov 22, 2024 • 13

MyTimeMachine: Personalized Facial Age Transformation

Paper • 2411.14521 • Published Nov 21, 2024 • 22

Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

Paper • 2411.13127 • Published Nov 20, 2024 • 4

Material Anything: Generating Materials for Any 3D Object via Diffusion

Paper • 2411.15138 • Published Nov 22, 2024 • 50

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 88

Learning 3D Representations from Procedural 3D Programs

Paper • 2411.17467 • Published Nov 25, 2024 • 9

TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 18

ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 87

DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

Paper • 2411.17786 • Published Nov 26, 2024 • 12

Adaptive Blind All-in-One Image Restoration

Paper • 2411.18412 • Published Nov 27, 2024 • 4

Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Paper • 2411.18478 • Published Nov 27, 2024 • 37

GRAPE: Generalizing Robot Policy via Preference Alignment

Paper • 2411.19309 • Published Nov 28, 2024 • 47

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Paper • 2411.18552 • Published Nov 27, 2024 • 18

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Paper • 2411.19146 • Published Nov 28, 2024 • 17

MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications

Paper • 2411.18915 • Published Nov 28, 2024 • 8

Reverse Thinking Makes LLMs Stronger Reasoners

Paper • 2411.19865 • Published Nov 29, 2024 • 23

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Paper • 2411.19638 • Published Nov 29, 2024 • 6

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Paper • 2411.19842 • Published Nov 29, 2024 • 12

TinyFusion: Diffusion Transformers Learned Shallow

Paper • 2412.01199 • Published Dec 2, 2024 • 14

o1-Coder: an o1 Replication for Coding

Paper • 2412.00154 • Published Nov 29, 2024 • 44

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 23

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

Paper • 2412.00568 • Published Nov 30, 2024 • 21

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Paper • 2412.01822 • Published Dec 2, 2024 • 15

Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

Paper • 2412.00176 • Published Nov 29, 2024 • 9

HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving

Paper • 2412.01718 • Published Dec 2, 2024 • 3

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Paper • 2412.01292 • Published Dec 2, 2024 • 13

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Paper • 2412.02687 • Published Dec 3, 2024 • 113

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 133

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Paper • 2412.03515 • Published Dec 4, 2024 • 27

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Paper • 2412.02030 • Published Dec 2, 2024 • 19

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Paper • 2412.03558 • Published Dec 4, 2024 • 20

CleanDIFT: Diffusion Features without Noise

Paper • 2412.03439 • Published Dec 4, 2024 • 13

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Paper • 2412.03085 • Published Dec 4, 2024 • 12

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Paper • 2412.04455 • Published Dec 5, 2024 • 38

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 24

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper • 2412.07744 • Published Dec 10, 2024 • 20

Are Your LLMs Capable of Stable Reasoning?

Paper • 2412.13147 • Published Dec 17, 2024 • 94

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Paper • 2412.11863 • Published Dec 16, 2024 • 4

TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning

Paper • 2412.10447 • Published Dec 11, 2024 • 5

The Open Source Advantage in Large Language Models (LLMs)

Paper • 2412.12004 • Published Dec 16, 2024 • 10

Smaller Language Models Are Better Instruction Evolvers

Paper • 2412.11231 • Published Dec 15, 2024 • 28

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Paper • 2412.11919 • Published Dec 16, 2024 • 36

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 108

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Paper • 2412.12606 • Published Dec 17, 2024 • 41

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Paper • 2412.13171 • Published Dec 17, 2024 • 36

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Paper • 2412.10704 • Published Dec 14, 2024 • 16

Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

Paper • 2412.11713 • Published Dec 16, 2024 • 6

RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

Paper • 2412.13746 • Published Dec 18, 2024 • 9

Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 120

Evaluating and Aligning CodeLLMs on Human Preference

Paper • 2412.05210 • Published Dec 6, 2024 • 50

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published Dec 12, 2024 • 48

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 54

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Paper • 2412.09626 • Published Dec 12, 2024 • 21

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Paper • 2409.11242 • Published Sep 17, 2024 • 7

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 98

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 156

AniDoc: Animation Creation Made Easier

Paper • 2412.14173 • Published Dec 18, 2024 • 56

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published Dec 18, 2024 • 24

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Paper • 2412.13795 • Published Dec 18, 2024 • 20

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published Dec 19, 2024 • 37

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 55

Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published Dec 19, 2024 • 73

Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 376

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

Paper • 2412.14042 • Published Dec 18, 2024 • 6

Predicting the Original Appearance of Damaged Historical Documents

Paper • 2412.11634 • Published Dec 16, 2024 • 4

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

Paper • 2412.14123 • Published Dec 18, 2024 • 11

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 53

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Paper • 2412.13649 • Published Dec 18, 2024 • 20

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Paper • 2412.14590 • Published Dec 19, 2024 • 14

Multi-LLM Text Summarization

Paper • 2412.15487 • Published Dec 20, 2024 • 6

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Paper • 2412.14963 • Published Dec 19, 2024 • 6

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 36

SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval

Paper • 2412.15443 • Published Dec 19, 2024 • 10

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Paper • 2412.14922 • Published Dec 19, 2024 • 88

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39

YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published Dec 23, 2024 • 65

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 20

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Paper • 2412.18605 • Published Dec 24, 2024 • 22

Automating the Enterprise with Foundation Models

Paper • 2405.03710 • Published May 3, 2024 • 1

1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 84

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published Dec 24, 2024 • 75

Bringing Objects to Life: 4D generation from 3D objects

Paper • 2412.20422 • Published Dec 29, 2024 • 40

Edicho: Consistent Image Editing in the Wild

Paper • 2412.21079 • Published Dec 30, 2024 • 23

Slow Perception: Let's Perceive Geometric Figures Step-by-step

Paper • 2412.20631 • Published Dec 30, 2024 • 15

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

Paper • 2412.20005 • Published Dec 28, 2024 • 17

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

Paper • 2412.04862 • Published Dec 6, 2024 • 50

ProgCo: Program Helps Self-Correction of Large Language Models

Paper • 2501.01264 • Published Jan 2 • 26

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Paper • 2501.00316 • Published Dec 31, 2024 • 23

MapQaTor: A System for Efficient Annotation of Map Query Datasets

Paper • 2412.21015 • Published Dec 30, 2024 • 10

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87

AutoPresent: Designing Structured Visuals from Scratch

Paper • 2501.00912 • Published Jan 1 • 8

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Paper • 2501.03226 • Published Jan 6 • 44

Test-time Computing: from System-1 Thinking to System-2 Thinking

Paper • 2501.02497 • Published Jan 5 • 46

Personalized Graph-Based Retrieval for Large Language Models

Paper • 2501.02157 • Published Jan 4 • 31

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

Paper • 2501.02790 • Published Jan 6 • 9

Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback

Paper • 2501.03916 • Published Jan 7 • 16

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Paper • 2501.03262 • Published Jan 4 • 102

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 46

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 52

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 81

GeAR: Generation Augmented Retrieval

Paper • 2501.02772 • Published Jan 6 • 22

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Paper • 2501.04689 • Published Jan 8 • 17

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Paper • 2501.04519 • Published Jan 8 • 285

LLM4SR: A Survey on Large Language Models for Scientific Research

Paper • 2501.04306 • Published Jan 8 • 36

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

Paper • 2501.04682 • Published Jan 8 • 99

Multi-task retriever fine-tuning for domain-specific and efficient RAG

Paper • 2501.04652 • Published Jan 8 • 10

Enhancing Human-Like Responses in Large Language Models

Paper • 2501.05032 • Published Jan 9 • 57

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10 • 65

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Paper • 2501.05707 • Published Jan 10 • 20

Demystifying Domain-adaptive Post-training for Financial LLMs

Paper • 2501.04961 • Published Jan 9 • 12

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11 • 88

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Paper • 2501.06282 • Published Jan 10 • 53

UnCommon Objects in 3D

Paper • 2501.07574 • Published Jan 13 • 13

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 297

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15 • 31

CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

Paper • 2501.08983 • Published Jan 15 • 20

SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces

Paper • 2501.09756 • Published Jan 16 • 19

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16 • 48

Evolving Deeper LLM Thinking

Paper • 2501.09891 • Published Jan 17 • 115

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Paper • 2501.10120 • Published Jan 17 • 51

X-Dyna: Expressive Dynamic Human Image Animation

Paper • 2501.10021 • Published Jan 17 • 15

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20 • 28

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

Paper • 2501.12273 • Published Jan 21 • 14

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22 • 418

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Paper • 2501.10893 • Published Jan 18 • 26

GSTAR: Gaussian Surface Tracking and Reconstruction

Paper • 2501.10283 • Published Jan 17 • 5

Chain-of-Retrieval Augmented Generation

Paper • 2501.14342 • Published Jan 24 • 58

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Paper • 2411.19458 • Published Nov 29, 2024 • 6

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Paper • 2501.18512 • Published Jan 30 • 30

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

Paper • 2501.18804 • Published Jan 30 • 5

Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

Paper • 2502.03544 • Published Feb 5 • 43

PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback

Paper • 2502.00988 • Published Feb 3 • 6

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 242

Demystifying Long Chain-of-Thought Reasoning in LLMs

Paper • 2502.03373 • Published Feb 5 • 58

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Paper • 2502.01105 • Published Feb 3 • 20

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Paper • 2502.01506 • Published Feb 3 • 38

Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation

Paper • 2502.00306 • Published Feb 1 • 5

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Paper • 2502.02589 • Published Feb 4 • 10

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3 • 38

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Paper • 2502.00299 • Published Feb 1 • 2

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Paper • 2502.01068 • Published Feb 3 • 18

Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

Paper • 2501.19054 • Published Jan 31 • 10

SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model

Paper • 2501.18636 • Published Jan 28 • 32

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

Paper • 2502.01142 • Published Feb 3 • 24

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Paper • 2502.05003 • Published Feb 7 • 43

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

Paper • 2502.04416 • Published Feb 6 • 12

Continuous 3D Perception Model with Persistent State

Paper • 2501.12387 • Published Jan 21 • 4

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Paper • 2502.04350 • Published Feb 4 • 11

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published Feb 10 • 153

CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing

Paper • 2502.03997 • Published Feb 6 • 9

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

Paper • 2502.07374 • Published Feb 11 • 40

Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey

Paper • 2502.06872 • Published Feb 8 • 8

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Paper • 2502.08047 • Published Feb 12 • 28

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Paper • 2502.07346 • Published Feb 11 • 54

Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

Paper • 2502.08127 • Published Feb 12 • 58

NatureLM: Deciphering the Language of Nature for Scientific Discovery

Paper • 2502.07527 • Published Feb 11 • 20

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published Feb 11 • 29

Retrieval-augmented Large Language Models for Financial Time Series Forecasting

Paper • 2502.05878 • Published Feb 9 • 42

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly

Paper • 2502.05761 • Published Feb 9 • 7

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Paper • 2502.08946 • Published Feb 13 • 194

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Paper • 2502.06608 • Published Feb 10 • 40

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Paper • 2502.08910 • Published Feb 13 • 148

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Paper • 2502.12513 • Published Feb 18 • 16

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Paper • 2502.14499 • Published Feb 20 • 192

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20 • 149

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Paper • 2502.14739 • Published Feb 20 • 104

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 206

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Paper • 2502.13128 • Published Feb 18 • 41

NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation

Paper • 2502.12638 • Published Feb 18 • 8

Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval

Paper • 2502.13369 • Published Feb 19 • 2

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Paper • 2502.14768 • Published Feb 20 • 47

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Paper • 2502.14834 • Published Feb 20 • 24

S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Paper • 2502.12853 • Published Feb 18 • 29

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Paper • 2502.14846 • Published Feb 20 • 14

NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

Paper • 2502.14638 • Published Feb 20 • 11

Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework

Paper • 2502.13759 • Published Feb 19 • 4

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17 • 9

Thus Spake Long-Context Large Language Model

Paper • 2502.17129 • Published Feb 24 • 72

Slamming: Training a Speech Language Model on One GPU in a Day

Paper • 2502.15814 • Published Feb 19 • 69

Audio-FLAN: A Preliminary Release

Paper • 2502.16584 • Published Feb 23 • 36

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Paper • 2502.15007 • Published Feb 20 • 174

Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

Paper • 2502.19204 • Published Feb 26 • 11

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

Paper • 2502.16776 • Published Feb 24 • 6

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Paper • 2502.18137 • Published Feb 25 • 57

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Paper • 2502.17055 • Published Feb 24 • 20

Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge

Paper • 2502.16457 • Published Feb 23 • 11

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Paper • 2502.17262 • Published Feb 24 • 22

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Paper • 2502.18017 • Published Feb 25 • 21

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Paper • 2502.20969 • Published Feb 28 • 11

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

Paper • 2503.00865 • Published Mar 2 • 64

MPO: Boosting LLM Agents with Meta Plan Optimization

Paper • 2503.02682 • Published Mar 4 • 27

Wikipedia in the Era of LLMs: Evolution and Risks

Paper • 2503.02879 • Published Mar 4 • 22

EgoLife: Towards Egocentric Life Assistant

Paper • 2503.03803 • Published Mar 5 • 44

Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7 • 122

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published Mar 5 • 231

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Paper • 2503.04973 • Published Mar 6 • 25

EuroBERT: Scaling Multilingual Encoders for European Languages

Paper • 2503.05500 • Published Mar 7 • 79

Transformers without Normalization

Paper • 2503.10622 • Published Mar 13 • 169

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Paper • 2503.09573 • Published Mar 12 • 73

Motion Anything: Any to Motion Generation

Paper • 2503.06955 • Published Mar 10 • 35

Monte Carlo Diffusion for Generalizable Learning-Based RANSAC

Paper • 2503.09410 • Published Mar 12 • 1

AI-native Memory 2.0: Second Me

Paper • 2503.08102 • Published Mar 11 • 13

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

Paper • 2503.05860 • Published Mar 7 • 11

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

Paper • 2503.08619 • Published Mar 11 • 20

Gemini Embedding: Generalizable Embeddings from Gemini

Paper • 2503.07891 • Published Mar 10 • 43

SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models

Paper • 2503.07605 • Published Mar 10 • 68

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

Paper • 2503.04872 • Published Mar 6 • 15

RWKV-7 "Goose" with Expressive Dynamic State Evolution

Paper • 2503.14456 • Published Mar 18 • 152

Impossible Videos

Paper • 2503.14378 • Published Mar 18 • 61

Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

Paper • 2503.13424 • Published Mar 17 • 30

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

Paper • 2503.13265 • Published Mar 17 • 15

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Paper • 2503.13111 • Published Mar 17 • 7

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Paper • 2503.13434 • Published Mar 17 • 27

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Paper • 2503.12937 • Published Mar 17 • 30

Personalize Anything for Free with Diffusion Transformer

Paper • 2503.12590 • Published Mar 16 • 44

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Paper • 2503.12885 • Published Mar 17 • 43

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Paper • 2503.11647 • Published Mar 14 • 145

PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Paper • 2503.07677 • Published Mar 10 • 86

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14 • 117

VGGT: Visual Geometry Grounded Transformer

Paper • 2503.11651 • Published Mar 14 • 32

TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing

Paper • 2503.11629 • Published Mar 14 • 6

MaRI: Material Retrieval Integration across Domains

Paper • 2503.08111 • Published Mar 11 • 7

SynCity: Training-Free Generation of 3D Worlds

Paper • 2503.16420 • Published Mar 20 • 27

M3: 3D-Spatial MultiModal Memory

Paper • 2503.16413 • Published Mar 20 • 15

Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Paper • 2503.16252 • Published Mar 20 • 29

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

Paper • 2503.16375 • Published Mar 20 • 10

Sonata: Self-Supervised Learning of Reliable Point Representations

Paper • 2503.16429 • Published Mar 20 • 12

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

Paper • 2503.17422 • Published Mar 21 • 6

Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering

Paper • 2503.15879 • Published Mar 20 • 6

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Paper • 2503.21460 • Published Mar 27 • 83

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

Paper • 2503.21749 • Published Mar 27 • 26

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation

Paper • 2503.21729 • Published Mar 27 • 29

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Paper • 2503.21758 • Published Mar 27 • 22

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

Paper • 2503.20672 • Published Mar 26 • 14

MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search

Paper • 2503.20757 • Published Mar 26 • 11

OpenCity3D: What do Vision-Language Models know about Urban Environments?

Paper • 2503.16776 • Published Mar 21 • 3

Aether: Geometric-Aware Unified World Modeling

Paper • 2503.18945 • Published Mar 24 • 28

Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

Paper • 2503.17811 • Published Mar 22 • 13

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

Paper • 2503.21732 • Published Mar 27 • 9

Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Paper • 2503.24377 • Published Mar 31 • 18

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Paper • 2503.23461 • Published Mar 30 • 94

Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

Paper • 2503.22236 • Published Mar 28 • 11

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Paper • 2503.20785 • Published Mar 26 • 22

AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Paper • 2503.19693 • Published Mar 25 • 76

MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs

Paper • 2503.23022 • Published Mar 29 • 6

Query and Conquer: Execution-Guided SQL Generation

Paper • 2503.24364 • Published Mar 31 • 17

Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL

Paper • 2503.23157 • Published Mar 29 • 10

Scene-Centric Unsupervised Panoptic Segmentation

Paper • 2504.01955 • Published Apr 2 • 4

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31 • 300

OmniSVG: A Unified Scalable Vector Graphics Generation Model

Paper • 2504.06263 • Published Apr 8 • 179

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8 • 110

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Paper • 2504.05897 • Published Apr 8 • 19

URECA: Unique Region Caption Anything

Paper • 2504.05305 • Published Apr 7 • 35

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Paper • 2504.03193 • Published Apr 4 • 4

Agentic Knowledgeable Self-awareness

Paper • 2504.03553 • Published Apr 4 • 27

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Paper • 2504.03561 • Published Apr 4 • 18

agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11 • 1.57k • • 679

SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning

Paper • 2504.08600 • Published Apr 11 • 31

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

Paper • 2504.05262 • Published Apr 7 • 11

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

Paper • 2504.08727 • Published Apr 11 • 11

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Paper • 2504.07615 • Published Apr 10 • 33

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10 • 132

Towards Learning to Complete Anything in Lidar

Paper • 2504.12264 • Published Apr 16 • 9

BlockGaussian: Efficient Large-Scale Scene Novel View Synthesis via Adaptive Block-Based Gaussian Splatting

Paper • 2504.09048 • Published Apr 12 • 7

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Paper • 2504.11393 • Published Apr 15 • 18

Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion

Paper • 2504.11447 • Published Apr 15 • 4

AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference

Paper • 2504.10326 • Published Apr 14 • 25

A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis

Paper • 2504.12322 • Published Apr 11 • 28

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Paper • 2504.13180 • Published Apr 17 • 18

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Paper • 2504.11651 • Published Apr 15 • 31

Retrieval-Augmented Generation with Conflicting Evidence

Paper • 2504.13079 • Published Apr 17 • 6

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Paper • 2504.13835 • Published Apr 18 • 38

NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes

Paper • 2504.11544 • Published Apr 15 • 43

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Paper • 2504.13157 • Published Apr 17 • 20

IberBench: LLM Evaluation on Iberian Languages

Paper • 2504.16921 • Published Apr 23 • 8

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Paper • 2504.17192 • Published Apr 24 • 120

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Paper • 2504.17432 • Published Apr 24 • 39

Large Language Models can accomplish Business Process Management Tasks

Paper • 2307.09923 • Published Jul 19, 2023

DataLab: A Unifed Platform for LLM-Powered Business Intelligence

Paper • 2412.02205 • Published Dec 3, 2024

Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

Paper • 2407.13517 • Published Jul 18, 2024

Layered Image Vectorization via Semantic Simplification

Paper • 2406.05404 • Published Jun 8, 2024 • 3

VectorMapNet: End-to-end Vectorized HD Map Learning

Paper • 2206.08920 • Published Jun 17, 2022

RETVec: Resilient and Efficient Text Vectorizer

Paper • 2302.09207 • Published Feb 18, 2023 • 3

MARVEL: Raster Manga Vectorization via Primitive-wise Deep Reinforcement Learning

Paper • 2110.04830 • Published Oct 10, 2021

Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use

Paper • 2405.20245 • Published May 30, 2024 • 1

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Paper • 2504.20734 • Published Apr 29 • 61

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Paper • 2505.04601 • Published May 7 • 28

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Paper • 2505.03318 • Published May 6 • 93

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

Paper • 2505.02922 • Published May 5 • 28

Geospatial Mechanistic Interpretability of Large Language Models

Paper • 2505.03368 • Published May 6 • 10

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Paper • 2505.01658 • Published May 3 • 39

Improving Editability in Image Generation with Layer-wise Memory

Paper • 2505.01079 • Published May 2 • 29

Real-World Gaps in AI Governance Research

Paper • 2505.00174 • Published Apr 30 • 12

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Paper • 2505.05288 • Published May 8 • 14

LiftFeat: 3D Geometry-Aware Local Feature Matching

Paper • 2505.03422 • Published May 6 • 8

Generating Physically Stable and Buildable LEGO Designs from Text

Paper • 2505.05469 • Published May 8 • 28

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Paper • 2505.07747 • Published May 12 • 60

Seed1.5-VL Technical Report

Paper • 2505.07062 • Published May 11 • 150

LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Paper • 2505.04253 • Published May 7 • 13

Depth Anything with Any Prior

Paper • 2505.10565 • Published May 15 • 12

AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

Paper • 2505.09926 • Published May 15 • 6

MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning

Paper • 2505.09265 • Published May 14 • 4

NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

Paper • 2505.16938 • Published May 22 • 120

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Paper • 2505.15277 • Published May 21 • 104

MMaDA: Multimodal Large Diffusion Language Models

Paper • 2505.15809 • Published May 21 • 95

VeriThinker: Learning to Verify Makes Reasoning Model Efficient

Paper • 2505.17941 • Published May 23 • 25

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Paper • 2505.18675 • Published May 24 • 25

ARM: Adaptive Reasoning Model

Paper • 2505.20258 • Published May 26 • 45

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Paper • 2505.17412 • Published May 23 • 20

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Paper • 2505.17399 • Published May 23 • 14

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

Paper • 2505.15692 • Published May 21 • 14

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Paper • 2505.15389 • Published May 21 • 8

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Paper • 2505.19147 • Published May 25 • 144

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Paper • 2505.23747 • Published May 29 • 68

Table-R1: Inference-Time Scaling for Table Reasoning

Paper • 2505.23621 • Published May 29 • 94

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29 • 45

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Paper • 2505.23604 • Published May 29 • 23

AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views

Paper • 2505.23716 • Published May 29 • 31

cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

Paper • 2505.22914 • Published May 28 • 36

LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

Paper • 2505.23758 • Published May 29 • 22

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Paper • 2505.22618 • Published May 28 • 42

ChartLens: Fine-grained Visual Attribution in Charts

Paper • 2505.19360 • Published May 25 • 3

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Paper • 2505.23764 • Published May 29 • 3

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

Paper • 2404.14351 • Published Apr 22, 2024 • 6

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Paper • 2403.13064 • Published Mar 19, 2024 • 31

DreamText: High Fidelity Scene Text Synthesis

Paper • 2405.14701 • Published May 23, 2024 • 1

Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

Paper • 2504.00387 • Published Apr 1

WonderWorld: Interactive 3D Scene Generation from a Single Image

Paper • 2406.09394 • Published Jun 13, 2024 • 3

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

Paper • 2311.01767 • Published Nov 3, 2023 • 21

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

Paper • 2505.24782 • Published May 30 • 1

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Paper • 2505.24417 • Published May 30 • 13

UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

Paper • 2505.24521 • Published May 30 • 15

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Paper • 2505.24862 • Published May 30 • 30

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Paper • 2506.01853 • Published Jun 2 • 31

EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Paper • 2506.01667 • Published Jun 2 • 21

CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

Paper • 2506.00530 • Published May 31 • 2

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published Jun 3 • 58

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Paper • 2506.03135 • Published Jun 3 • 38

SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL

Paper • 2506.00391 • Published May 31 • 9

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Paper • 2505.16968 • Published May 22 • 41

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

Paper • 2506.04180 • Published Jun 4 • 33

OpenThoughts: Data Recipes for Reasoning Models

Paper • 2506.04178 • Published Jun 4 • 48

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

Paper • 2506.03930 • Published Jun 4 • 26

SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

Paper • 2506.03139 • Published Jun 3 • 17

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5 • 74

BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

Paper • 2506.02587 • Published Jun 3 • 3

ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Paper • 2506.05010 • Published Jun 5 • 79

Squeeze3D: Your 3D Generation Model is Secretly an Extreme Neural Compressor

Paper • 2506.07932 • Published Jun 9 • 12

ECoRAG: Evidentiality-guided Compression for Long Context RAG

Paper • 2506.05167 • Published Jun 5 • 9

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

Paper • 2506.09952 • Published Jun 11 • 6

CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation

Paper • 2506.10890 • Published Jun 12 • 9

Text-Aware Image Restoration with Diffusion Models

Paper • 2506.09993 • Published Jun 11 • 41

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Paper • 2506.13585 • Published Jun 16 • 267

DoTA-RAG: Dynamic of Thought Aggregation RAG

Paper • 2506.12571 • Published Jun 14 • 50

TaskCraft: Automated Generation of Agentic Tasks

Paper • 2506.10055 • Published Jun 11 • 31

3D Arena: An Open Platform for Generative 3D Evaluation

Paper • 2506.18787 • Published Jun 23 • 12

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Paper • 2506.18841 • Published Jun 23 • 56

Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights

Paper • 2506.16406 • Published Jun 19 • 126

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Paper • 2506.19794 • Published Jun 24 • 8

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Paper • 2506.18095 • Published Jun 22 • 65

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Paper • 2506.19290 • Published Jun 24 • 52

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Paper • 2506.18951 • Published Jun 23 • 21

GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding

Paper • 2310.14478 • Published Oct 23, 2023 • 2

Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Paper • 2503.12843 • Published Mar 17

Foundation Models for Generalist Geospatial Artificial Intelligence

Paper • 2310.18660 • Published Oct 28, 2023 • 11

From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios

Paper • 2506.20279 • Published Jun 25 • 19

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Paper • 2506.21656 • Published Jun 26 • 14

SAM4D: Segment Anything in Camera and LiDAR Streams

Paper • 2506.21547 • Published Jun 26 • 17

Depth Anything at Any Condition

Paper • 2507.01634 • Published Jul 2 • 52

MMSearch-R1: Incentivizing LMMs to Search

Paper • 2506.20670 • Published Jun 25 • 64

SIO-Mapper: A Framework for Lane-Level HD Map Construction Using Satellite Images and OpenStreetMap with No On-Site Visits

Paper • 2504.09882 • Published Apr 14

OneRestore: A Universal Restoration Framework for Composite Degradation

Paper • 2407.04621 • Published Jul 5, 2024 • 1

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Paper • 2507.01352 • Published Jul 2 • 54

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Paper • 2507.02694 • Published Jul 3 • 19

Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models

Paper • 2506.22813 • Published Jun 28 • 7

PyVision: Agentic Vision with Dynamic Tooling

Paper • 2507.07998 • Published Jul 10 • 31

MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4 • 153

OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Paper • 2507.06165 • Published Jul 8 • 57

4KAgent: Agentic Any Image to 4K Super-Resolution

Paper • 2507.07105 • Published Jul 9 • 104

Perception-Aware Policy Optimization for Multimodal Reasoning

Paper • 2507.06448 • Published Jul 8 • 47

Rethinking Verification for LLM Code Generation: From Generation to Testing

Paper • 2507.06920 • Published Jul 9 • 28

T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Paper • 2507.05964 • Published Jul 8 • 118

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

Paper • 2507.11407 • Published Jul 15 • 57

BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Paper • 2507.04127 • Published Jul 5 • 7

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Paper • 2507.09862 • Published Jul 14 • 49

LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

Paper • 2507.04404 • Published Jul 6 • 21

Subject-Consistent and Pose-Diverse Text-to-Image Generation

Paper • 2507.08396 • Published Jul 11 • 15

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Paper • 2507.05255 • Published Jul 7 • 74

KV Cache Steering for Inducing Reasoning in Small Language Models

Paper • 2507.08799 • Published Jul 11 • 40

π^3: Scalable Permutation-Equivariant Visual Geometry Learning

Paper • 2507.13347 • Published Jul 17 • 64

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper • 2507.13348 • Published Jul 17 • 75

Voxtral

Paper • 2507.13264 • Published Jul 17 • 28

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Paper • 2507.11527 • Published Jul 15 • 31

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Paper • 2507.09477 • Published Jul 13 • 84

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Paper • 2507.12806 • Published Jul 17 • 20

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Paper • 2507.16815 • Published Jul 22 • 39

Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model

Paper • 2507.11465 • Published Jul 15 • 16

DesignLab: Designing Slides Through Iterative Detection and Correction

Paper • 2507.17202 • Published Jul 23 • 50

EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Paper • 2507.16535 • Published Jul 22 • 20

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Paper • 2507.01955 • Published Jul 2 • 35

Indoor Scene Generation from a Collection of Semantic-Segmented Depth Images

Paper • 2108.09022 • Published Aug 20, 2021

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Paper • 2507.22827 • Published Jul 30 • 98

Dens3R: A Foundation Model for 3D Geometry Prediction

Paper • 2507.16290 • Published Jul 22 • 8

Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models

Paper • 2508.02120 • Published Aug 4 • 19

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Paper • 2508.05405 • Published Aug 7 • 64

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Paper • 2508.03644 • Published Aug 5 • 25

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

Paper • 2508.03404 • Published Aug 5 • 3

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Paper • 2508.07981 • Published Aug 11 • 58

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Paper • 2508.07493 • Published Aug 10 • 8

VertexRegen: Mesh Generation with Continuous Level of Detail

Paper • 2508.09062 • Published Aug 12 • 36

HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches

Paper • 2508.08088 • Published Aug 11 • 29

Matrix-3D: Omnidirectional Explorable 3D World Generation

Paper • 2508.08086 • Published Aug 11 • 74

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Paper • 2508.05748 • Published Aug 7 • 135

TexVerse: A Universe of 3D Objects with High-Resolution Textures

Paper • 2508.10868 • Published Aug 14 • 16

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Paper • 2508.10419 • Published Aug 14 • 73

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

Paper • 2508.11383 • Published Aug 15 • 39

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds

Paper • 2508.14879 • Published Aug 20 • 65

MV-RAG: Retrieval Augmented Multiview Diffusion

Paper • 2508.16577 • Published Aug 22 • 37

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Paper • 2508.15769 • Published Aug 21 • 19

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22 • 150

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Paper • 2508.15746 • Published Aug 21 • 13

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Paper • 2509.03867 • Published Sep 4 • 207

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings

Paper • 2508.18733 • Published Aug 26 • 8

NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings

Paper • 2509.04011 • Published Sep 4 • 28

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Paper • 2509.06888 • Published Sep 8 • 12

HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

Paper • 2509.09713 • Published Sep 8 • 24

GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction

Paper • 2509.18090 • Published 21 days ago • 2

AutoIntent: AutoML for Text Classification

Paper • 2509.21138 • Published 18 days ago • 29

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Paper • 2509.20414 • Published 19 days ago • 8

CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Paper • 2509.21150 • Published 18 days ago • 3

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Paper • 2509.22414 • Published 17 days ago • 21

Democratizing AI scientists using ToolUniverse

Paper • 2509.23426 • Published 16 days ago • 38

SparseD: Sparse Attention for Diffusion Language Models

Paper • 2509.24014 • Published 15 days ago • 30

VGGT-X: When VGGT Meets Dense Novel View Synthesis

Paper • 2509.25191 • Published 14 days ago • 18

LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals

Paper • 2509.21875 • Published 18 days ago • 9

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Paper • 2509.26507 • Published 13 days ago • 458

Reinforcement Learning on Pre-Training Data

Paper • 2509.19249 • Published 20 days ago • 65

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Paper • 2510.05592 • Published 7 days ago • 82

HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

Paper • 2510.00880 • Published 12 days ago

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Paper • 2510.06308 • Published 6 days ago • 47

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Paper • 2510.03215 • Published 10 days ago • 84

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Paper • 2510.06590 • Published 6 days ago • 67

Artificial Hippocampus Networks for Efficient Long-Context Modeling

Paper • 2510.07318 • Published 5 days ago • 23

When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Paper • 2510.07499 • Published 5 days ago • 42

Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

Paper • 2508.09755 • Published Aug 13