Evals & Monitoring - a sbarman25 Collection

sbarman25 's Collections

Training & Architectures

Models

Safety / Alignment / Policies / SMI

Evals & Monitoring

Spaces

Agentic

Vulnerabilities

CV / Text-to-Image / Image-to-Image / Diffusion

Others

Hardware-aware Models

Tool Usage (w/VLMs)

Vision Language Models

Evals & Monitoring

updated Jul 25

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Paper • 2303.16634 • Published Mar 29, 2023 • 3
miracl/miracl-corpus

Viewer • Updated Jan 5, 2023 • 77.2M • 2.75k • 44

Note https://github.com/project-miracl/miracl?tab=readme-ov-file MTEB: https://github.com/embeddings-benchmark/mteb
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 29
How is ChatGPT's behavior changing over time?

Paper • 2307.09009 • Published Jul 18, 2023 • 23
Evaluating Large Language Models: A Comprehensive Survey

Paper • 2310.19736 • Published Oct 30, 2023 • 2
Instruction-Following Evaluation for Large Language Models

Paper • 2311.07911 • Published Nov 14, 2023 • 19
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Paper • 2303.08896 • Published Mar 15, 2023 • 4

Note https://aclanthology.org/2023.eacl-main.75/
Landmark Attention: Random-Access Infinite Context Length for Transformers

Paper • 2305.16300 • Published May 25, 2023

Note Original "needle-in-haystack-test" for long context input - Passkey Retrieval More: https://arxiv.org/pdf/2402.13753.pdf
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Paper • 2402.03744 • Published Feb 6 • 4

Note The hyperparameters, including temperature, top-k and top-p, of the LLMs’decoder determine the diversity of the generations. To evaluate the impact of those hyper parameters. We provide a sensitivity analysis in Figure 4. As observed, the performance is greatly influenced by temperature but shows little sensitivity to top-k. The performance of the consistency based methods (EigenScore and Lexical Similarity) drops significantly when the temperature is greater than 1.
Chainpoll: A high efficacy method for LLM hallucination detection

Paper • 2310.18344 • Published Oct 22, 2023 • 1

Note As a substitute for G-Eval
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models

Paper • 2305.13711 • Published May 23, 2023 • 2
WILDS: A Benchmark of in-the-Wild Distribution Shifts

Paper • 2012.07421 • Published Dec 14, 2020 • 1
Extending the WILDS Benchmark for Unsupervised Adaptation

Paper • 2112.05090 • Published Dec 9, 2021 • 1
vectara/hallucination_evaluation_model

Text Classification • Updated 25 days ago • 291k • 228
MMMU/MMMU

Viewer • Updated Sep 19 • 11.6k • 13.1k • 194
HuggingFaceH4/mt_bench_prompts

Viewer • Updated Jul 3, 2023 • 80 • 373 • 16
Running on CPU Upgrade

80

🥇

HHEM Leaderboard
Sleeping

6

✅

Transparency Self Assessment (FMTI)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 33
TRUE: Re-evaluating Factual Consistency Evaluation

Paper • 2204.04991 • Published Apr 11, 2022 • 1
Evaluating Very Long-Term Conversational Memory of LLM Agents

Paper • 2402.17753 • Published Feb 27 • 18
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2 • 118
TIGER-Lab/MMLU-Pro

Viewer • Updated Oct 18 • 12.1k • 29.8k • 286
patched-codes/static-analysis-eval

Viewer • Updated Sep 13 • 113 • 817 • 16
nvidia/ChatRAG-Bench

Viewer • Updated May 24 • 34.6k • 2.21k • 100
Running

217

🦁

AI2 WildBench Leaderboard (V2)
allenai/WildBench

Viewer • Updated 20 days ago • 2.3k • 3.45k • 33