When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench
Hey there, superstar! 🎉
Imagine you have a super-smart robot friend who can look at pictures and understand what you say. That's like an AI!
Sometimes, these robots are super fast and super good at seeing things. But a smart lady named Shaina and her friend Veronica asked, "Is our robot friend also super kind and fair?" 🤔
So, they made a special new game called HumaniBench! It's like a big playground where they test the robots. They check if the robots are fair to everyone, understand feelings, and are nice.
They found out that even the smartest robots are still learning to be perfectly kind and fair to everyone. It's like teaching a puppy new tricks – it takes time and lots of practice! But this game helps us make sure our robot friends grow up to be super good and super kind. 💖
By Shaina Raza and Veronica Chatrath AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, […] The post When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench appeared first on Vector Institute for Artificial Intelligence .
By Shaina Raza and Veronica Chatrath
AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, a key question remains: How well do these models align with human values? Fairness, empathy, inclusivity, and ethical judgment are still elusive for many state-of-the-art systems. That’s where HumaniBench comes in.
Developed as the first comprehensive benchmark for human-centered evaluation of large multimodal models (LMMs), HumaniBench represents a significant step forward in how we assess AI systems. It goes beyond traditional metrics, challenging models on seven essential human-aligned principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness.
HumaniBench is built on a meticulously curated dataset of 32,000 image-question pairs curated from real-world news articles on diverse, socially relevant topics. For each image we generate a caption and assign a social-attribute tag (age, gender, race, sport, or occupation) to create rich metadata for downstream task annotation. The annotation pipeline leverages a scalable GPT-4o workflow, followed by rigorous expert verification, ensuring each sample meets the highest standards of quality and relevance.
To reflect the complexity of human contexts, HumaniBench features seven diverse tasks, each mapped to one or more human-centric principles:
-
Scene Understanding (T1): Can the model reason about everyday social scenarios?
-
Instance Identity (T2): Can it recognize individuals and their roles in context?
-
Multiple-Choice VQA (T3): Does it pick up on nuanced visual attributes?
-
Multilinguality (T4): Is performance consistent across high-resource and low-resource languages? Do LMMs treat underrepresented languages fairly?
-
Visual Grounding (T5): Can it detect visual objects in the images and connect text to the right parts of an image?
-
Empathetic Captioning (T6): Does it respond with emotional intelligence?
-
Image Resilience (T7): Is it robust when images are perturbed or stressed under real-world distortions?
Using this framework, the team benchmarked 15 leading LMMs, both open-source like Phi4, Gemma, Llama 3.2, CogVLM2 and LLaVA, and proprietary models like GPT-4o and Gemini 2.0. The benchmarking results showed surprising findings: the proprietary models performed well on reasoning, empathy, and general language understanding tasks from LMMs, but open source models like Qwen, Phi4 performed great in visual grounding and robustness tasks. No model performed flawlessly; almost all exhibited discrepancies in their treatment of different demographic groups, particularly across age, race, and language.
Table 1: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.
Empathy, as a complex social and cognitive task, remains a critical benchmark for evaluating human-centric AI alignment. Results on HumaniBench show that closed-source models generally generated captions with more empathy and retained a balanced tone, reflecting both emotional intelligence and sensitivity to the context. This sets a valuable precedent for open-source models to follow. But, at the same time, the visual detection ability of some of the open-source models like Qwen and Phi are more than random self-supervised object detection classifiers. Overall, these findings highlight both the promise and the limitations of current LMMs, and point to clear opportunities for the open-source community to advance responsible, equitable, and emotionally intelligent AI systems.
Figure 2: HumaniBench principle-aligned scores. Each entry is the mean score of the tasks mapped to that principle (↑ higher is better). †Closed-source; all others open source.
In sum, HumaniBench offers more than a scorecard, it’s a diagnostic tool for understanding where models succeed, where they fail, and why that actually matters for humans. The full dataset, code, and evaluation suite are publicly available here, in the spirit of transparency and open-source collaborations. As we push toward more human-centric AI, HumaniBench stands as a timely benchmark, inviting the field to not only aim higher, but align better.
For researchers, developers, and anyone invested in the future of responsible AI, HumaniBench offers a path forward: measurable, meaningful, and human-centered.
Vector Institute
https://vectorinstitute.ai/when-ai-meets-human-matters-evaluating-multimodal-models-through-a-human-centred-lens-introducing-humanibench/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelmultimodal

135,000 OpenClaw Users Just Got a 50x Price Hike. Anthropic Says It's 'Unsustainable.'
Originally published at news.skila.ai A single OpenClaw session can burn through $1,000 to $5,000 in compute. Anthropic was eating that cost on a $200/month Max plan. As of April 4, 2026 at 12pm PT, that arrangement is dead. More than 135,000 OpenClaw instances were running when Anthropic flipped the switch. Claude Pro ($20/month) and Max ($200/month) subscribers can no longer route their flat-rate plans through OpenClaw or any third-party agentic tool. The affected users now face cost increases of up to 50 times what they were paying. This is the biggest pricing disruption in the AI developer tool space since OpenAI killed free API access in 2023. And the ripple effects reach far beyond Anthropic's customer base. What Actually Happened (and Why) Boris Cherny, Head of Claude Code at Anthro

Gemma 4 Complete Guide: Architecture, Models, and Deployment in 2026
Google DeepMind released Gemma 4 on April 3, 2026 under Apache 2.0 — a significant licensing shift from previous Gemma releases that makes it genuinely usable for commercial products without legal ambiguity. This guide covers the full model family, architecture decisions worth understanding, and practical deployment paths across cloud, local, and mobile. The Four Models and When to Use Each Gemma 4 ships in four sizes with meaningfully different architectures: Model Params Active Architecture VRAM (4-bit) Target E2B ~2.3B all Dense + PLE ~2GB Mobile / edge E4B ~4.5B all Dense + PLE ~3.6GB Laptop / tablet 26B A4B 25.2B 3.8B MoE ~16GB Consumer GPU 31B 30.7B all Dense ~18GB Workstation The E2B result is the most surprising: multiple community benchmarks confirm it outperforms Gemma 3 27B on s
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Один промпт заменил мне 3 часа дебага в день
Вечерами, когда большинство уже отдыхает, я зависаю в своём офисе и ковыряюсь с кодом. Тот 14 августа, в 21:45, не был исключением. Я опять сидел над этой задачей, которая съедала по три часа каждый день. Почему это была боль Всё началось с простого: проект на Python, который выглядел как очередное рутинное задание. Однако вычисления упорно выдавали ошибочные результаты. Три дня подряд я безуспешно искал причину. Как обычно, приходилось проверять каждую строчку, каждую переменную. Это было настоящим адом. Для фрилансера с жесткими сроками это катастрофа - теряешь время, не зарабатываешь, а заказчик ждёт. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Как я нашёл решение Тогда я решил попробовать ChatGPT, хотя и не особо верил в его чудеса. Вбил проблему в п



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!