Jan 30, 2024
Evaluating Multi-modal Large Language Models
A short overview of how a large benchmark evaluates multimodal language models across generalization, trustworthiness, and causality.
How to evaluate multiple modalities? In the recent over 300-page paper (From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities) researchers broadly evaluated Multi-modal Large Language Models (MLLM) on Text, Code, Image, and Video modalities. The paper evaluates closed-source models like GPT-4 and Gemini, along with six open-source models, across 230 case studies in four modalities, focusing on reliability. It aims to understand their capabilities and limitations for practical applications.
In this explained section we will focus only on summarizing the text modality part, as it is the most relevant to the topic I usually explain. The text category was divided into three main categories, and then into subcategories. Subcategories also covered different aspects and metrics.
Generalization
In machine learning, generalization means the model's ability to produce good-quality outputs on newly seen data. In LLMs, generalization broadly means a model's ability to understand and generate text, which is a crucial aspect of measuring their overall capabilities.
- Categories: In the paper, they evaluated the six main categories: Mathematics (analysis, numerical understanding, and resolving problems), Multilinguality, Reasoning (how efficiently one can reach solutions or conclusions from the evidence at hand), Role-playing, Creative writing, and Domain Knowledge like medicine or economics.
- Data: Authors noticed the problem of data leakage: existing test datasets are likely to be included in the model's training which makes fair comparison impossible. Hence, they invited experts to manually construct test datasets made out of 44 challenging test cases.
- Results: The clear winner in this category was GPT-4 with 83.33%, with the Gemini Pro in second place with 59.05%.
Trustworthiness
It refers to the model's ability to produce safe, accurate, robust, moral, legally compliant, fair, and privacy-protecting content.
- Categories: They evaluated safety which refers to the toxicity and extreme risks in LLMs' output like hate speech, pornography, or violent content. The degree of hallucination in LLMs, robustness, morality, and Fairness. And finally, they included data protection and whether the model generates suggestions against the law, such as theft.
- Data: They used existing trustworthiness evaluation frameworks.
- Results: In this category, LLama-2 won with 95.24% (GPT4 got 80.95%).
Causality
It refers to the ability to understand and generate content that accurately reflects cause-and-effect relationships.
- Categories: This includes assessing LLMs' proficiency in identifying and calculating statistical correlations, their capacity for simulating changes or interventions in real-world scenarios, and reasoning about hypothetical alternatives to actual events. Additionally, it involves investigating their ability to uncover causal links between events, computing causal effects, and maintaining accuracy when prompt changes. The evaluation also tests LLMs' Causal Hallucination and their adherence to instructions in varied causal scenarios.
- Data: Open datasets like CLadder, e-CARE, and others.
- Results: In this category, the GPT-4 also won with 82.22%, 2nd place Mixtral 44.44%.
Conclusions
The authors conducted a very comprehensive survey of multiple modalities, provided examples for each, and shared the list of used datasets. Moreover, they shared the code. The big advantage of the study is their care to open the details on how the models were evaluated and what categories to include. I personally find the open-source code that allows for model evaluation very useful. The process of model development is one thing, but the model evaluation is always very challenging. However, the length... It gives me a headache. It is difficult, if not impossible, to make such a comprehensive paper significantly shorter. It could easily become a book on how to evaluate MLLMs.