The components in this cloneable are copy-paste ready. Please go to the docs for full compliance setup.
This website uses cookies to maximize your experience and help us to understand how we can improve it. By clicking 'Accept', you consent to the use of these cookies. If you would like to manage your cookie settings, you can control this in your internet browser. Find out more in our Privacy Policy

Framework

Released

Strengths

Weaknesses

BLEU

2002

Simple and easy to use

Limited in capturing nuances

ROUGE

2004

Effective for text summarisation tasks

Less effective for other tasks

GLUE

2018

Comprehensive suite of tasks

May lack domain-specific tasks

SuperGlue

2019

More complex tasks for better evaluation

Complexity can be a drawback

BBH

2021

Incorporates human evaluation

Subjectivity in human assessment

Big-Bench

2022

Benchmark for large-scale LLMs

Scalability challenges

MMLU

2023

Holistic assessment across tasks

Complexity in interpretation