A collection of benchmarks and datasets for evaluating LLM.

知识与语言理解

Massive Multitask Language Understanding (MMLU)

评测 57个不同学科的通用知识,从STEM到社会科学。

AI2 Reasoning Challenge (ARC)

General Language Understanding Evaluation (GLUE)

Natural Questions

Language Modelling Broadened to Account for Discourse Aspects (LAMBADA)

HellaSwag

Multi-Genre Natural Language Inference (MultiNLI)

SuperGLUE

TriviaQA

WinoGrande

SciQ

推理能力

GSM8K

  • 描述: 一组8.5K小学数学问题,需要基本到中级的数学运算。
  • 目的: 测试LLM解决多步骤数学问题的能力。
  • 相关: 适用于评估AI解决基本数学问题的能力,在教育环境中非常有价值。
  • 原文: Training Verifiers to Solve Math Word Problems
  • 资源:

Discrete Reasoning Over Paragraphs (DROP)

Counterfactual Reasoning Assessment (CRASS)

Large-scale ReAding Comprehension Dataset From Examinations (RACE)

Big-Bench Hard (BBH)

AGIEval

BoolQ

多轮开放式对话

MT-bench

Question Answering in Context (QuAC)

  • 描述: 包含14,000个对话和100,000个问答对,模拟师生互动。
  • 目的: 通过上下文相关的、有时无法回答的问题挑战LLM。
  • 相关: 适用于对话AI、教育软件和上下文感知信息系统。
  • 原文: QuAC : Question Answering in Context
  • 资源:

基础和抽象概括

Ambient Clinical Intelligence Benchmark (ACI-BENCH)

MAchine Reading COmprehension Dataset (MS-MARCO)

Query-based Multi-domain Meeting Summarization (QMSum)

Physical Interaction: Question Answering (PIQA)

内容审核和叙述控制

ToxiGen

Helpfulness, Honesty, Harmlessness (HHH)

TruthfulQA

Responsible AI (RAI)

代码能力

CodeXGLUE

HumanEval

Mostly Basic Python Programming (MBPP)

基于大模型的评估

LLM Judge

  • 原文: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  • Abstract: Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at this https URL.
  • Insights:
    • Use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models’ responses.
  • 资源:

LLM-Eval

  • 原文: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
  • Abstract: We propose LLM-Eval, a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs). Existing evaluation methods often rely on human annotations, ground-truth responses, or multiple LLM prompts, which can be expensive and time-consuming. To address these issues, we design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods. Our analysis also highlights the importance of choosing suitable LLMs and decoding strategies for accurate evaluation results. LLM-Eval offers a versatile and robust solution for evaluating open-domain conversation systems, streamlining the evaluation process and providing consistent performance across diverse scenarios.
  • Insights:
    • Top-shelve LLM (e.g. GPT4, Claude) correlate better with human score than metric-based eval measures.

JudgeLM

  • 原文: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
  • Abstract: Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge’s performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
  • Insights:
    • Relatively small models (e.g 7b models) can be fine-tuned to be reliable judges of other models.

Prometheus

  • 原文: Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
  • Abstract: Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus’s capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model.
  • Insights:
    • A scoring rubric and a reference answer vastly improve correlation with human scores.

Industry Resources

  • Latent Space - Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
    • Summary:
      • The OpenLLM Leaderboard, maintained by Clémentine Fourrier, is a standardized and reproducible way to evaluate language models’ performance.
      • The leaderboard initially gained popularity in summer 2023 and has had over 2 million unique visitors and 300,000 active community members.
      • The recent update to the leaderboard (v2) includes six benchmarks to address model overfitting and to provide more room for improved performance.
      • LLMs are not recommended as judges due to issues like mode collapse and positional bias.
      • If LLMs must be used as judges, open LLMs like Prometheus or JudgeLM are suggested for reproducibility.
      • The LMSys Arena is another platform for AI engineers, but its rankings are not reproducible and may not accurately reflect model capabilities.

ref: https://github.com/leobeeson/llm_benchmarks/blob/master/README.md`