Evals & Monitoring
- Paper • 2303.16634 • Published • 3
miracl/miracl-corpus
Viewer • Updated • 77.2M • 2.75k • 44Note https://github.com/project-miracl/miracl?tab=readme-ov-file MTEB: https://github.com/embeddings-benchmark/mteb
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 29How is ChatGPT's behavior changing over time?
Paper • 2307.09009 • Published • 23Evaluating Large Language Models: A Comprehensive Survey
Paper • 2310.19736 • Published • 2Instruction-Following Evaluation for Large Language Models
Paper • 2311.07911 • Published • 19
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Paper • 2303.08896 • Published • 4
Landmark Attention: Random-Access Infinite Context Length for Transformers
Paper • 2305.16300 • PublishedNote Original "needle-in-haystack-test" for long context input - Passkey Retrieval More: https://arxiv.org/pdf/2402.13753.pdf
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
Paper • 2402.03744 • Published • 4Note The hyperparameters, including temperature, top-k and top-p, of the LLMs’decoder determine the diversity of the generations. To evaluate the impact of those hyper parameters. We provide a sensitivity analysis in Figure 4. As observed, the performance is greatly influenced by temperature but shows little sensitivity to top-k. The performance of the consistency based methods (EigenScore and Lexical Similarity) drops significantly when the temperature is greater than 1.
Chainpoll: A high efficacy method for LLM hallucination detection
Paper • 2310.18344 • Published • 1Note As a substitute for G-Eval
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
Paper • 2305.13711 • Published • 2WILDS: A Benchmark of in-the-Wild Distribution Shifts
Paper • 2012.07421 • Published • 1Extending the WILDS Benchmark for Unsupervised Adaptation
Paper • 2112.05090 • Published • 1vectara/hallucination_evaluation_model
Text Classification • Updated • 291k • 228MMMU/MMMU
Viewer • Updated • 11.6k • 13.1k • 194HuggingFaceH4/mt_bench_prompts
Viewer • Updated • 80 • 373 • 16- Running on CPU Upgrade80🥇
HHEM Leaderboard
- Sleeping6✅
Transparency Self Assessment (FMTI)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 33TRUE: Re-evaluating Factual Consistency Evaluation
Paper • 2204.04991 • Published • 1Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper • 2402.17753 • Published • 18Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 118TIGER-Lab/MMLU-Pro
Viewer • Updated • 12.1k • 29.8k • 286patched-codes/static-analysis-eval
Viewer • Updated • 113 • 817 • 16nvidia/ChatRAG-Bench
Viewer • Updated • 34.6k • 2.21k • 100- Running217🦁
AI2 WildBench Leaderboard (V2)
allenai/WildBench
Viewer • Updated • 2.3k • 3.45k • 33