Logo MixEval-X

Any-to-Any Evaluations from Real-World Data Mixtures


aNational University of Singapore, bNanyang Technological University, cPeking University,
dCarnegie Mellon University, eUniversity of Waterloo, fIndependent Researcher

†Correspondence to: Jinjie Ni <jinjieni@nus.edu.sg>
*Now at Google DeepMind

🔔News

🚀[2024-10-20]: MixEval-X is released! Checkout the Paper and Leaderboard to learn more about this real-world any-to-any benchmark!🌟

Introduction

Perceiving and generating diverse modalities are crucial for AI models to effec- tively learn from and engage with real-world signals, necessitating reliable eval- uations for their development. We identify two major issues in current evalua- tions: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generaliza- tion biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across di- verse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X’s model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank ex- isting models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

TL;DR: MixEval-X is the first any-to-any, real-world benchmark featuring diverse input-output modalities, real-world task distributions, consistent high standards across modalities, and dynamism. It achieves up to 0.98 correlation with arena-like multi-modal evaluations while being way more efficient.

Logo MixEval-X

Why to Use MixEval-X Benchmarks?

🥇 It extends all the benefits of MixEval to multi-modal evaluations, including comprehensive and less biased query distribution; fair grading (except open-ended tasks); dynamism; accurate model ranking; fast, cost-effective, reproducible execution; and challenging nature.

🥇 It establishes unified, high standards across modalities and communities. For single-modality models, it ensures its evaluation keeps up with the state-of-the-art standards; for multi-modality models, it ensures consistent, high-standard evaluations across modalities, preventing any from becoming a bottleneck.

🥇 Beyond model evaluation, MixEval-X benchmarks different organizations (as shown in the first Figure) with balanced dimensions (modalities), unlocking a new level of evaluation.

Leaderboards

🌆

Image2Text

📽️

Video2Text

🎧

Audio2Text

🌆

Text2Image

📽️

Text2Video

🎧

Text2Audio

🧑‍🦯

Text2Action

🚶

Image2Action

MixEval-X Image2Text Leaderboard

Open-Source Proprietary
Image2Text
🥇
Image2Text-Hard
🥇
SEED
(Mixed)
MMMU
(Mixed)
DocVQA
(Mixed)
TextVQA
(Mixed)
VisWiz
(Mixed)
InfographicVQA
(Mixed)
SEED-Hard
(Mixed)
MMMU-Hard
(Mixed)
Claude 3.5 Sonnet 76.9 46.2 76.0 75.1 94.6 90.3 62.5 78.8 31.0 48.9
GPT-4o 76.6 45.8 75.6 74.1 87.4 90.9 66.9 79.0 29.3 45.9
GPT-4V 75.0 44.6 75.6 68.0 92.1 89.3 53.7 79.2 31.9 40.6
Qwen2-VL-72B 74.8 43.4 71.5 67.5 90.6 90.3 66.3 80.4 25.4 27.8
Gemini 1.5 Pro 74.2 42.2 72.2 77.2 85.6 86.8 63.7 76.7 29.7 44.4
Llama 3.2 90B 73.0 40.6 73.3 62.9 92.7 90.9 61.6 89.8 28.9 30.1
InternVL2-26B 71.5 41.5 71.5 55.8 90.3 91.2 58.2 70.2 32.3 28.6
InternVL-Chat-V1.5 70.1 37.5 70.7 56.9 83.6 83.1 55.3 61.2 22.0 18.8
Claude 3 Opus 69.5 41.1 72.0 66.5 84.2 86.7 56.9 66.9 34.9 44.4
Qwen-VL-MAX 69.2 37.5 70.0 68.5 83.1 87.2 53.1 66.1 27.6 37.6
LLaVA-1.6-34B 68.1 37.5 70.4 60.4 71.0 81.8 48.6 58.8 31.9 36.8
Claude 3 Sonnet 67.8 38.3 71.1 50.8 86.7 80.3 58.2 78.6 32.3 30.8
Reka Core 67.4 37.3 67.5 71.1 76.5 79.9 56.9 59.6 25.0 39.1
Reka Flash 67.4 36.6 73.6 53.8 71.3 76.8 59.6 62.5 32.8 23.3
InternVL-Chat-V1.2 67.2 36.0 70.7 54.8 51.8 76.3 60.0 59.2 25.4 33.8
Qwen-VL-PLUS 67.0 35.9 66.2 56.9 84.1 83.1 57.5 52.7 19.8 27.1
Claude 3 Haiku 66.1 37.5 67.8 58.4 88.3 83.0 59.8 59.4 32.8 45.9
Gemini 1.0 Pro 66.1 35.0 67.6 60.9 70.3 81.3 55.7 51.8 29.3 39.8
InternLM-XComposer2-VL 62.1 33.6 66.9 40.6 54.7 74.9 56.3 46.5 28.9 24.8
InternVL-Chat-V1.1 58.5 30.9 68.0 46.7 38.3 64.6 52.5 37.5 28.4 30.8
Yi-VL-34B 58.5 30.6 68.0 53.8 21.5 59.7 53.3 41.4 27.6 29.3
OmniLMM-12B 58.2 29.2 67.3 54.8 42.3 70.2 48.6 26.9 31.9 32.3
DeepSeek-VL-7B-Chat 56.7 26.5 61.3 41.1 39.4 69.9 50.8 32.0 21.1 14.3
Yi-VL-6B 55.4 30.1 65.6 45.7 23.6 62.3 52.2 28.0 27.6 19.5
InfiMM-Zephyr-7B 53.7 29.4 62.5 44.2 21.9 46.1 46.1 27.6 26.7 25.6
CogVLM 51.5 23.7 54.4 25.4 46.4 70.5 46.5 56.1 21.6 11.3
MiniCPM-V 51.5 25.9 59.1 32.0 53.2 76.6 40.8 32.2 23.7 18.0
Marco-VL 50.5 24.3 56.0 37.1 48.2 58.1 37.3 40.6 19.0 27.8
LLaVA-1.5-13B 50.2 26.0 56.9 32.5 22.4 53.7 42.9 24.3 19.0 24.8
SVIT 49.9 25.4 59.1 35.5 19.9 51.2 42.9 27.8 27.6 15.8
mPLUG-OWL2 48.9 22.5 57.5 28.9 26.9 59.7 39.8 29.4 28.0 10.5
SPHINX 47.5 23.8 54.5 39.1 16.4 51.0 41.4 24.5 19.8 18.0
InstructBLIP-T5-XXL 46.2 21.5 58.0 31.0 11.2 41.7 44.3 24.5 19.4 28.6
InstructBLIP-T5-XL 45.5 22.9 53.1 32.0 14.5 44.5 44.5 12.9 21.1 18.8
BLIP-2 FLAN-T5-XXL 45.2 21.6 55.1 33.0 13.5 46.3 42.2 29.6 22.8 17.3
BLIP-2 FLAN-T5-XL 43.0 20.0 52.5 33.5 16.3 40.9 39.2 9.4 23.3 11.3
Adept Fuyu-Heavy 37.4 19.4 43.5 26.4 6.9 41.1 35.5 8.2 21.6 11.3
LLaMA-Adapter2-7B 36.6 20.4 42.5 32.5 15.6 23.7 44.5 25.1 18.1 14.3
Otter 34.1 18.5 42.5 31.5 5.3 17.9 21.2 21.4 23.3 9.8
MiniGPT4-Vicuna-13B 32.1 15.8 38.2 25.4 15.4 23.4 33.7 18.4 15.5 13.5

MixEval-X Video2Text Leaderboard

Open-Source Proprietary
Video2Text
🥇
Video2Text-Hard
🥇
ActivityNet-QA
(Mixed)
HowToQA
(Mixed)
TVQA
(Mixed)
MSVD-QA
(Mixed)
NextQA-freetext
(Mixed)
TGIF-QA
(Mixed)
ActivityNet-QA-Hard
(Mixed)
TVQA-Hard
(Mixed)
Claude 3.5 Sonnet 74.2 45.5 73.3 76.6 64.8 79.4 76.4 78.9 60.4 39.4
GPT-4o 72.7 38.9 64.6 78.2 74.6 80.9 70.1 78.2 32.4 48.0
Gemini 1.5 Pro 71.8 38.1 65.2 64.8 82.6 82.9 74.4 75.7 43.2 68.5
GPT-4V 71.0 40.0 63.4 78.2 69.5 77.9 69.5 78.5 37.2 37.8
Qwen2-VL-72B 66.5 32.0 55.1 76.6 58.1 74.2 65.0 78.5 27.3 17.3
Gemini 1.5 Flash 66.3 33.9 59.0 67.4 70.3 73.8 61.4 72.3 26.7 51.2
LLaVA-OneVision-72B-OV 64.7 32.0 56.0 77.0 64.4 71.2 64.9 70.6 35.6 28.3
Qwen2-VL-7B 64.2 31.9 54.3 74.7 52.1 74.9 62.6 68.9 27.2 26.0
LLaVA-Next-Video-34B 63.1 28.4 56.1 68.6 62.7 74.0 62.8 68.0 26.7 38.6
Claude 3 Haiku 58.7 29.4 52.3 63.6 48.7 70.8 62.7 70.2 23.6 29.1
LLaVA-Next-Video-7B 58.7 27.2 53.2 62.1 44.5 72.5 61.0 74.4 25.9 33.1
Reka-edge 58.7 27.3 51.7 72.4 46.6 69.1 59.3 65.2 29.0 22.8
LLaMA-VID 55.6 23.8 52.9 60.9 36.0 72.8 61.3 67.1 19.1 17.3
VideoLLaVA 55.3 22.6 51.7 64.0 39.4 66.7 61.9 64.7 18.2 26.0
Video-ChatGPT 46.4 20.7 45.7 46.7 25.4 72.2 56.3 64.8 24.7 14.2
mPLUG-video 39.1 17.8 41.5 36.4 23.3 71.9 56.7 61.8 22.7 7.9

MixEval-X Audio2Text Leaderboard

Open-Source Proprietary
Audio2Text
🥇
Audio2Text-Hard
🥇
Clotho-AQA
(Mixed)
DAQA
(Mixed)
Clotho-AQA-Hard
(Mixed)
DAQA-Hard
(Mixed)
Gemini 1.5 Pro 62.7 24.0 67.4 53.4 26.8 21.7
Gemini 1.5 Flash 60.1 23.0 67.1 46.9 27.4 19.7
Qwen2-Audio-7B-Instruct 58.8 23.5 64.7 46.0 22.5 23.5
Qwen2-Audio-7B 56.6 24.6 63.1 44.0 29.9 20.0
SALMONN-13B 52.5 20.9 57.6 41.4 14.9 25.4
Qwen-Audio 52.4 16.0 61.5 33.8 19.0 12.8
Qwen-Audio-Chat 50.2 20.0 55.7 39.4 19.8 19.7
SALMONN-7B 38.9 17.1 46.6 22.2 20.6 11.6
Pengi 22.6 8.2 26.9 14.4 12.5 3.8

MixEval-X Text2Image Leaderboard

Open-Source Proprietary
Text2Image Elo
🥇
95% CI Text2Image Elo (1st Turn)
🥇
95% CI (1st Turn) Text2Image Elo (2nd Turn)
🥇
95% CI (2nd Turn)
Flux 1054 -11/15 1054 -20/20 1058 -15/21
DALL·E 3 HD 1047 -11/12 1062 -19/19 1031 -17/24
PixArtAlpha 1037 -15/14 1031 -18/21 1041 -17/16
PlayGround V2.5 1027 -12/14 1027 -20/26 1030 -24/16
PlayGround V2 1023 -13/12 1021 -22/17 1022 -16/19
SD3 993 -18/12 986 -18/18 998 -18/17
Stable Cascade 961 -13/15 968 -24/18 956 -19/25
SD1.5 936 -14/14 931 -16/21 940 -22/22
SDXL 916 -13/14 918 -18/18 918 -21/20

MixEval-X Text2Video Leaderboard

Open-Source Proprietary
Text2Video Elo
🥇
95% CI Text2Video Elo (1st Turn)
🥇
95% CI (1st Turn) Text2Video Elo (2nd Turn)
🥇
95% CI (2nd Turn)
HotShot-XL 1024 -8/10 1024 -12/14 1025 -12/11
CogVideoX-5B 1014 -10/8 1020 -14/12 1008 -11/14
LaVie 1013 -9/10 1009 -14/12 1017 -11/14
VideoCrafter2 996 -9/8 1002 -14/12 990 -13/10
ModelScope 995 -9/9 987 -13/13 1004 -16/11
ZeroScope V2 984 -10/11 972 -11/10 998 -14/14
Show-1 970 -7/8 983 -12/12 955 -13/12

MixEval-X Text2Audio Leaderboard

Open-Source Proprietary
Text2Audio Elo
🥇
95% CI Text2Audio Elo (1st Turn)
🥇
95% CI (1st Turn) Text2Audio Elo (2nd Turn)
🥇
95% CI (2nd Turn)
AudioLDM 2 1034 -14/18 1036 -19/19 1036 -19/19
Make-An-Audio 2 1019 -14/16 1023 -19/23 1012 -20/32
Stable Audio 1019 -14/14 1023 -17/22 1018 -23/19
Tango 2 1010 -16/16 995 -27/17 1025 -27/18
ConsistencyTTA 1005 -17/15 1005 -24/24 1006 -22/26
AudioGen 982 -13/14 978 -16/23 985 -22/22
Magnet 926 -14/16 939 -20/28 912 -16/23

MixEval-X Text2Action Leaderboard

MixEval-X Image2Action Leaderboard

Meta-Evaluation

Benchmark Query Distribution

Dist Distribution
Dist Distribution
Dist Distribution
Dist Distribution
Dist Distribution
Dist Distribution
-->

Citation


      @article{ni2024mixevalx,
        title={MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures},
        author={Ni, Jinjie and Song, Yifan and Ghosal, Deepanway and Li, Bo and Zhang, David Junhao and Yue, Xiang and Xue, Fuzhao and Zheng, Zian and Zhang, Kaichen and Shah, Mahir and Jain, Kabir and You, Yang and Shieh, Michael},
        journal={arXiv preprint arXiv:2410.13754},
        year={2024}
      }

      @article{ni2024mixeval,
        title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures},
        author={Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang},
        journal={arXiv preprint arXiv:2406.06565},
        year={2024}
      }