MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders Paper • 2410.06845 • Published Oct 9 • 5
MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders Paper • 2410.06845 • Published Oct 9 • 5
MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders Paper • 2410.06845 • Published Oct 9 • 5 • 2
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist Paper • 2407.08733 • Published Jul 11 • 20
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist Paper • 2407.08733 • Published Jul 11 • 20
PromptBench: A Unified Library for Evaluation of Large Language Models Paper • 2312.07910 • Published Dec 13, 2023 • 15
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation Paper • 2312.07424 • Published Dec 12, 2023 • 7
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation Paper • 2312.07424 • Published Dec 12, 2023 • 7
A Survey on Evaluation of Large Language Models Paper • 2307.03109 • Published Jul 6, 2023 • 42
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization Paper • 2306.05087 • Published Jun 8, 2023 • 6
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts Paper • 2306.04528 • Published Jun 7, 2023 • 3
A Survey on Evaluation of Large Language Models Paper • 2307.03109 • Published Jul 6, 2023 • 42 • 1
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts Paper • 2306.04528 • Published Jun 7, 2023 • 3
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization Paper • 2306.05087 • Published Jun 8, 2023 • 6