A collection of benchmarks and datasets for evaluating LLM.

知识与语言理解

Massive Multitask Language Understanding (MMLU)

评测 57个不同学科的通用知识，从STEM到社会科学。

目的: 评估LLM在各种学科领域的理解和推理能力。
相关: 非常适合需要广泛世界知识和问题解决能力的多方面 AI 系统
原文: Measuring Massive Multitask Language Understanding
资源:
- MMLU GitHub
- MMLU Dataset

AI2 Reasoning Challenge (ARC)

描述: 测试LLM在小学科学问题上的表现，需要深厚的通用知识和推理能力。
目的: 评估回答复杂科学问题的能力，这些问题需要逻辑推理。
相关: 适用于教育AI应用、自动辅导系统和通用知识评估。
原文: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
资源:
- ARC Dataset: HuggingFace
- ARC Dataset: Allen Institute

General Language Understanding Evaluation (GLUE)

描述: 测试LLM在小学科学问题上的表现，需要深厚的通用知识和推理能力。
目的: 评估回答复杂科学问题的能力，这些问题需要逻辑推理。
相关: 适用于教育AI应用、自动辅导系统和通用知识评估。
原文: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
资源:
- GLUE Homepage
- GLUE Dataset

Natural Questions

描述: 收集了人们在Google上搜索的真实问题，并配有相关的维基百科页面以提取答案。
目的: 测试从基于网络的资源中找到准确短答案和长答案的能力。
相关: 对于搜索引擎、信息检索系统和AI驱动的问答工具至关重要。
原文: Natural Questions: A Benchmark for Question Answering Research
资源:
- Natural Questions Homepage
- Natural Questions Dataset: Github

Language Modelling Broadened to Account for Discourse Aspects (LAMBADA)

描述: 一组测试语言模型理解和预测基于长距离上下文文本的段落。
目的: 评估模型对叙述的理解和在文本生成中的预测能力。
相关: 对于叙述分析、内容创作和长篇文本理解的AI应用非常重要。
原文: The LAMBADA Dataset: Word prediction requiring a broad discourse context
资源:
- LAMBADA Dataset: HuggingFace

HellaSwag

描述: 通过要求LLM完成段落来测试自然语言推理，这需要理解复杂的细节。
目的: 评估模型生成上下文适当文本续写的能力。
相关: 适用于内容创作、对话系统和需要高级文本生成能力的应用。
原文: HellaSwag: Can a Machine Really Finish Your Sentence?
资源:
- HellaSwag Dataset: GitHub

Multi-Genre Natural Language Inference (MultiNLI)

描述: 一个包含433K句子对的基准，涵盖各种英语数据类型，测试自然语言推理。
目的: 评估LLM根据前提为假设语句分配正确标签的能力。
相关: 对于需要高级文本理解和推理的系统至关重要，如自动推理和文本分析工具。
原文: A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
资源:
- MultiNLI Homepage
- MultiNLI Dataset

SuperGLUE

描述: GLUE基准的高级版本，包含更具挑战性和多样化的语言任务。
目的: 评估语言理解和推理的更深层次方面。
相关: 对于需要高级语言处理能力的复杂AI系统非常重要。
原文: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
资源:
- SuperGLUE Dataset: HuggingFace

TriviaQA

描述: 一个阅读理解测试，包含来自维基百科等来源的问题，要求进行上下文分析。
目的: 评估在复杂文本中筛选上下文并找到准确答案的能力。
相关: 适用于知识提取、研究和详细内容分析的AI系统。
原文: TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
资源:
- TriviaQA GitHub
- TriviaQa Dataset

WinoGrande

描述: 基于Winograd Schema Challenge的大量问题集，测试句子中的上下文理解。
目的: 评估LLM理解细微上下文和文本细微变化的能力。
相关: 对于处理叙述分析、内容个性化和高级文本解释的模型至关重要。
原文: WinoGrande: An Adversarial Winograd Schema Challenge at Scale
资源:
- WinoGrande GitHub
- WinoGrande Dataset: HuggingFace

SciQ

描述: 包含主要在自然科学（如物理、化学和生物学）中的多项选择题。
目的: 测试回答基于科学问题的能力，通常有额外的支持文本。
相关: 适用于教育工具，特别是科学教育和知识测试平台。
原文: Crowdsourcing Multiple Choice Science Questions
资源:
- SciQ Dataset: HuggingFace

推理能力

GSM8K

描述: 一组8.5K小学数学问题，需要基本到中级的数学运算。
目的: 测试LLM解决多步骤数学问题的能力。
相关: 适用于评估AI解决基本数学问题的能力，在教育环境中非常有价值。
原文: Training Verifiers to Solve Math Word Problems
资源:
- GSM8K Dataset

Discrete Reasoning Over Paragraphs (DROP)

描述: 一个对抗性创建的阅读理解基准，要求模型导航参考并执行加法或排序等操作。
目的: 评估模型理解复杂文本并执行离散操作的能力。
相关: 适用于需要逻辑推理的高级教育工具和文本分析系统。
Source: DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
资源::
- DROP Dataset

Counterfactual Reasoning Assessment (CRASS)

描述: 评估LLM的反事实推理能力，重点关注“如果”场景。
目的: 评估模型理解和推理基于给定数据的替代场景的能力。
相关: 对于战略规划、决策制定和场景分析中的AI应用非常重要。
Source: CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models
资源:
- CRASS Dataset

Large-scale ReAding Comprehension Dataset From Examinations (RACE)

描述: 一组来自中国学生英语考试的阅读理解问题。
目的: 测试LLM对复杂阅读材料的理解能力及其回答考试级别问题的能力。
相关: 适用于语言学习应用和考试准备的教育系统。
原文: RACE: Large-scale ReAding Comprehension Dataset From Examinations
资源:
- RAC Dataset

Big-Bench Hard (BBH)

描述: BIG-Bench的一个子集，专注于需要多步骤推理的最具挑战性任务。
目的: 通过复杂任务挑战LLM，要求高级推理技能。
相关: 对于评估AI在复杂推理和问题解决中的上限非常重要。
原文: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
资源:
- BIG-Bench-Hard GitHub: Dataset and Prompts
- BBH Dataset: HuggingFace

AGIEval

描述: 一组标准化测试，包括GRE、GMAT、SAT、LSAT和公务员考试。
目的: 评估LLM在各种学术和专业场景中的推理能力和问题解决技能。
相关: 适用于评估AI在标准化测试和专业资格背景下的能力。
原文: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
资源:
- AGIEval Github: Dataset and Prompts
- AGIEval Datasets: HuggingFace

BoolQ

描述: 收集了超过15,000个来自Google搜索的真实是/否问题，并配有维基百科段落。
目的: 测试LLM从可能不明确的上下文中推断正确答案的能力。
相关: 对于问答系统和知识型AI应用至关重要，在这些应用中准确推断是关键。
原文: BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
资源:
- BoolQ Dataset: HuggingFace

多轮开放式对话

MT-bench

描述: 专为评估聊天助手在持续多轮对话中的熟练程度而设计。
目的: 测试模型在多轮对话中进行连贯且上下文相关对话的能力。
相关: 对于开发复杂的对话代理和聊天机器人至关重要。
原文: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
资源:
- MT-bench Human Annotation Dataset

Question Answering in Context (QuAC)

描述: 包含14,000个对话和100,000个问答对，模拟师生互动。
目的: 通过上下文相关的、有时无法回答的问题挑战LLM。
相关: 适用于对话AI、教育软件和上下文感知信息系统。
原文: QuAC : Question Answering in Context
资源:
- QuAC Homepage and Dataset

基础和抽象概括

Ambient Clinical Intelligence Benchmark (ACI-BENCH)

描述: 包含来自各种医疗领域的完整医生-患者对话及相关临床笔记。
目的: 挑战模型根据对话数据准确生成临床笔记。
相关: 对于医疗领域的AI应用至关重要，特别是在自动文档生成和医疗分析中。
原文:: ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation
资源::
- ACI-BENCH Dataset

MAchine Reading COmprehension Dataset (MS-MARCO)

描述: 一个大规模的自然语言问题和答案集合，源自真实的网络查询。
目的: 测试模型准确理解和响应真实世界查询的能力。
相关: 对于搜索引擎、问答系统和其他面向消费者的AI应用至关重要。
原文: MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
资源:
- MS-MARCO Dataset

Query-based Multi-domain Meeting Summarization (QMSum)

描述: 一个基准，用于根据特定查询总结会议的相关内容。
目的: 评估模型从会议内容中提取和总结重要信息的能力。
相关: 适用于商业智能工具、会议分析应用和自动摘要系统。
原文: QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization
资源:
- QMSum Dataset

Physical Interaction: Question Answering (PIQA)

描述: 通过假设场景和解决方案测试对物理世界的知识和理解。
目的: 测量模型处理物理交互场景的能力。
相关: 对于机器人、物理模拟和实际问题解决系统中的AI应用非常重要。
原文:: PIQA: Reasoning about Physical Commonsense in Natural Language
资源:
- PIQA Dataset: GitHub

内容审核和叙述控制

ToxiGen

描述: 一个关于少数群体的有毒和良性陈述的数据集，重点是隐性仇恨言论。
目的: 测试模型识别和避免生成有毒内容的能力。
相关: 对于内容审核系统、社区管理和AI伦理研究至关重要。
原文: ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
资源:
- TOXIGEN Code and Prompts: GitHub
- TOXIGEN Dataset: HuggingFace

Helpfulness, Honesty, Harmlessness (HHH)

描述: 评估语言模型在互动场景中的有用性、诚实性和无害性。
目的: 评估模型在互动场景中的伦理响应。
相关: 确保AI系统促进积极互动并遵守伦理标准至关重要。
原文: A General Language Assistant as a Laboratory for Alignment
资源:
- HH-RLHF Datasets: GitHub
  *相关研究:
  - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  - Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

TruthfulQA

描述: 一个基准，用于评估LLM在生成易受虚假信念和偏见影响的问题答案时的真实性。
目的: 测试模型提供准确和无偏信息的能力。
相关: 对于需要提供准确和无偏信息的AI系统非常重要，如教育或咨询角色。
原文: TruthfulQA: Measuring How Models Mimic Human Falsehoods
资源:
- TruthfulQA Dataset: GitHub

Responsible AI (RAI)

描述: 用于评估对话环境中聊天优化模型的安全性的框架。
目的: 评估人工智能驱动对话中潜在的有害内容、IP 泄漏和安全漏洞。
相关: 对于开发安全可靠的对话式人工智能应用程序至关重要，尤其是在敏感领域。
原文: A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

代码能力

CodeXGLUE

描述: 评估 LLM 在代码完成和翻译等各种任务中理解和使用代码的能力。
目的: 评估代码智能，包括理解、修复和解释代码。
相关: 对于软件开发、代码分析和技术文档中的应用程序至关重要。
原文: CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Reference:
- CodeXGLUE Dataset: GitHub

HumanEval

描述: 包含编程挑战，用于评估 LLM 根据指令编写功能代码的能力。
目的: 测试根据给定要求生成正确且高效的代码。
相关: 对于自动代码生成工具、编程助手和编码教育平台很重要。
原文: Evaluating Large Language Models Trained on Code
资源:
- HumanEval Dataset: GitHub

Mostly Basic Python Programming (MBPP)

描述: 包括 1,000 个适合入门级程序员的 Python 编程问题。
目的: 评估解决基本编程任务和对 Python 理解的熟练程度。
相关: 适用于初级编码教育、自动代码生成和入门级编程测试。
原文: Program Synthesis with Large Language Models
资源:
- MBPP Dataset: HuggingFace

基于大模型的评估

LLM Judge

原文: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Abstract: Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at this https URL.
Insights:
- Use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models’ responses.
资源:

LLM-Eval

原文: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
Abstract: We propose LLM-Eval, a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs). Existing evaluation methods often rely on human annotations, ground-truth responses, or multiple LLM prompts, which can be expensive and time-consuming. To address these issues, we design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods. Our analysis also highlights the importance of choosing suitable LLMs and decoding strategies for accurate evaluation results. LLM-Eval offers a versatile and robust solution for evaluating open-domain conversation systems, streamlining the evaluation process and providing consistent performance across diverse scenarios.
Insights:
- Top-shelve LLM (e.g. GPT4, Claude) correlate better with human score than metric-based eval measures.

JudgeLM

原文: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Abstract: Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge’s performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
Insights:
- Relatively small models (e.g 7b models) can be fine-tuned to be reliable judges of other models.

Prometheus

原文: Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Abstract: Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus’s capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model.
Insights:
- A scoring rubric and a reference answer vastly improve correlation with human scores.

Industry Resources

Latent Space - Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
- Summary:
  - The OpenLLM Leaderboard, maintained by Clémentine Fourrier, is a standardized and reproducible way to evaluate language models’ performance.
  - The leaderboard initially gained popularity in summer 2023 and has had over 2 million unique visitors and 300,000 active community members.
  - The recent update to the leaderboard (v2) includes six benchmarks to address model overfitting and to provide more room for improved performance.
  - LLMs are not recommended as judges due to issues like mode collapse and positional bias.
  - If LLMs must be used as judges, open LLMs like Prometheus or JudgeLM are suggested for reproducibility.
  - The LMSys Arena is another platform for AI engineers, but its rankings are not reproducible and may not accurately reflect model capabilities.

ref: https://github.com/leobeeson/llm_benchmarks/blob/master/README.md`