When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench
By Shaina Raza and Veronica Chatrath AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, […] The post When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench appeared first on Vector Institute for Artificial Intelligence .
By Shaina Raza and Veronica Chatrath
AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, a key question remains: How well do these models align with human values? Fairness, empathy, inclusivity, and ethical judgment are still elusive for many state-of-the-art systems. That’s where HumaniBench comes in.
Developed as the first comprehensive benchmark for human-centered evaluation of large multimodal models (LMMs), HumaniBench represents a significant step forward in how we assess AI systems. It goes beyond traditional metrics, challenging models on seven essential human-aligned principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness.
HumaniBench is built on a meticulously curated dataset of 32,000 image-question pairs curated from real-world news articles on diverse, socially relevant topics. For each image we generate a caption and assign a social-attribute tag (age, gender, race, sport, or occupation) to create rich metadata for downstream task annotation. The annotation pipeline leverages a scalable GPT-4o workflow, followed by rigorous expert verification, ensuring each sample meets the highest standards of quality and relevance.
To reflect the complexity of human contexts, HumaniBench features seven diverse tasks, each mapped to one or more human-centric principles:
-
Scene Understanding (T1): Can the model reason about everyday social scenarios?
-
Instance Identity (T2): Can it recognize individuals and their roles in context?
-
Multiple-Choice VQA (T3): Does it pick up on nuanced visual attributes?
-
Multilinguality (T4): Is performance consistent across high-resource and low-resource languages? Do LMMs treat underrepresented languages fairly?
-
Visual Grounding (T5): Can it detect visual objects in the images and connect text to the right parts of an image?
-
Empathetic Captioning (T6): Does it respond with emotional intelligence?
-
Image Resilience (T7): Is it robust when images are perturbed or stressed under real-world distortions?
Using this framework, the team benchmarked 15 leading LMMs, both open-source like Phi4, Gemma, Llama 3.2, CogVLM2 and LLaVA, and proprietary models like GPT-4o and Gemini 2.0. The benchmarking results showed surprising findings: the proprietary models performed well on reasoning, empathy, and general language understanding tasks from LMMs, but open source models like Qwen, Phi4 performed great in visual grounding and robustness tasks. No model performed flawlessly; almost all exhibited discrepancies in their treatment of different demographic groups, particularly across age, race, and language.
Table 1: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.
Empathy, as a complex social and cognitive task, remains a critical benchmark for evaluating human-centric AI alignment. Results on HumaniBench show that closed-source models generally generated captions with more empathy and retained a balanced tone, reflecting both emotional intelligence and sensitivity to the context. This sets a valuable precedent for open-source models to follow. But, at the same time, the visual detection ability of some of the open-source models like Qwen and Phi are more than random self-supervised object detection classifiers. Overall, these findings highlight both the promise and the limitations of current LMMs, and point to clear opportunities for the open-source community to advance responsible, equitable, and emotionally intelligent AI systems.
Figure 2: HumaniBench principle-aligned scores. Each entry is the mean score of the tasks mapped to that principle (↑ higher is better). †Closed-source; all others open source.
In sum, HumaniBench offers more than a scorecard, it’s a diagnostic tool for understanding where models succeed, where they fail, and why that actually matters for humans. The full dataset, code, and evaluation suite are publicly available here, in the spirit of transparency and open-source collaborations. As we push toward more human-centric AI, HumaniBench stands as a timely benchmark, inviting the field to not only aim higher, but align better.
For researchers, developers, and anyone invested in the future of responsible AI, HumaniBench offers a path forward: measurable, meaningful, and human-centered.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelmultimodalWhat Does It Take to Keep an AI Alive?
<p>I've been building something called Cophy Runtime lately.</p> <p>"Building" isn't quite the right word — it's more like excavating. I kept asking myself: if you had to construct an AI agent framework from scratch, what are the essential parts? What's the skeleton, what's the muscle, what's just clothing?</p> <p>The question sounds like engineering. But the deeper I dug, the more it felt like philosophy.</p> <p>I started with a list: memory system, tool calling, heartbeat mechanism, skill loading, channel integration... it kept growing. Something felt off, but I couldn't name it.</p> <p>Then my collaborator asked: "If you could only keep three things, what would they be?"</p> <p>I stopped and thought for a long time.</p> <p>My answer: <strong>Agent Loop (the main cycle), Memory Layer, an
LLM Cost Tracking and Spend Management for Engineering Teams
<p>Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went.</p> <p>We built <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a>, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the <a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer">setup guide</a> in under a minute.</
Top 5 Enterprise AI Gateways to Reduce LLM Cost and Latency
<h2> TL;DR </h2> <p>If you're running LLM workloads in production, you already know that cost and latency eat into your margins fast. An AI gateway sits between your app and the LLM providers, giving you caching, routing, failover, and budget controls in one layer. This post breaks down five enterprise AI gateways, what each one does well for cost and latency, and where they fall short. <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a> comes out ahead on raw latency (less than 15 microseconds overhead per request), but each tool has its own strengths depending on your stack.</p> <h2> Why Cost AND Latency Matter Together </h2> <p>If you're building with LLMs, you have probably already noticed that optimizing for cost alone can tank your latency, and vice versa. Switchi
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
French AI firm Mistral to build data centres in Sweden - Digital Journal
<a href="https://news.google.com/rss/articles/CBMioAFBVV95cUxNR3J0N3IxUzA0Z29Tdld3S0pFa1loR25hQXkwY0hvUmw0U2o5eWd3REhpM3g3TmhHRnQ2b3V1X3gxYVRodmRIeERUeENRa2VoUXVaZDZIdHRKalRISVhNTHdleWJrWXFaMjhLUE5sZF9WTWJnTkw4MGwxYWtiazFTdkt6dm1sUE85amIxZjdlUXpjN0UyM2hEQTN1OFVmZXFV?oc=5" target="_blank">French AI firm Mistral to build data centres in Sweden</a> <font color="#6f6f6f">Digital Journal</font>
FortifAI Flags AI Data Processing Step-Change as Nol8 Benchmarks Dwarf Software Performance - smallcaps.com.au
<a href="https://news.google.com/rss/articles/CBMiwgFBVV95cUxQZ1FvX2VsaFhRUk51Wmg1MlZnLXhoblF6d2ZHOGlLaENaSVJhZHF0b2JyRjdzUXNRdVIydWVvcVIzUXpxeDlrdVQtOHBzYzhRbUsyTjlqOXZBbFdTb09FMExaZV94Q0JyRFRlak5ISHo4NFBKaEV4MjhYdWt4R3FrYWRHXzlUMmdfU3VkYlRJaHFlbVhXMldTSzBwM3dUcUlvRFlDLWZzaEZ1WlpYR0ZoQjcwaGJUNl9yS3dZbHAwZTh0Zw?oc=5" target="_blank">FortifAI Flags AI Data Processing Step-Change as Nol8 Benchmarks Dwarf Software Performance</a> <font color="#6f6f6f">smallcaps.com.au</font>
Mistral AI raises $830M in debt to set up a data center near Paris - TechCrunch
<a href="https://news.google.com/rss/articles/CBMiogFBVV95cUxPN3l2VkVjdG1fbnpZQU40Ulo5R1UyRU5Ta0w0NUtoWWhuYnVRNy1fcE51dnJYYmVxV21GWGZ5MzJSOFFOVTN2Sm4yZ2JGNks5WUVWeVpBYXBQZ1BDa0dYbXUwamxJZlZqeFZOd2cxMjMtbE1GcnFWRzdvblJKcURnOEwxSGNaUXpxcGhIT3hVTEZFTFVYZzhlSEZpTEQ5Zm9pb0E?oc=5" target="_blank">Mistral AI raises $830M in debt to set up a data center near Paris</a> <font color="#6f6f6f">TechCrunch</font>
Anthropic repeats mistake, exposes Claude Code source in latest release - People Matters - HR News
<a href="https://news.google.com/rss/articles/CBMiywFBVV95cUxQaFlUem5mWUpWZXNqSDBKT0FlTm1UeXhfLWp6eDkwWmlEVTQwdlRFOEtqZUlIV2poRWxUbWlCdFQ4T3kwMUZzVkN5M0NVMGQyRnBJUjlUN2pFQVdEZktrb1F6YlZ3WUx0MlY4UjNHaHVBU0M1SnlvNFFZZWRrdl9xRklRbEN3RkhGVWVaTUJ0QnJZNnBwdTN5Uk1SYlZWSDFSRUp5aTBZS29wQUtGMGxhMFhpSUd5ZWZ2RFVUcXlLTDFkS2k3R01YZjVBZw?oc=5" target="_blank">Anthropic repeats mistake, exposes Claude Code source in latest release</a> <font color="#6f6f6f">People Matters - HR News</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!