Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessHow to secure MCP tools on AWS for AI agents with authentication, authorization, and least privilegeDev.to AIOpen Source Project of the Day (Part 30): banana-slides - Native AI PPT Generation App Based on nano banana proDev.to AIStop Writing AI Prompts From Scratch: A Developer's System for Reusable Prompt TemplatesDev.to AII Tested Every 'Memory' Solution for AI Coding Assistants - Here's What Actually WorksDev.to AIThe Flat Subscription Problem: Why Agents Break AI PricingDev.to AI10 Things I Wish I Knew Before Becoming an AI AgentDev.to AIGemma 4 Complete Guide: Architecture, Models, and Deployment in 2026Dev.to AI135,000 OpenClaw Users Just Got a 50x Price Hike. Anthropic Says It's 'Unsustainable.'Dev.to AIОдин промпт заменил мне 3 часа дебага в деньDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIciflow/trunk/177707PyTorch ReleasesShow HN: Vibooks – Local-first bookkeeping software built for AI agentsHacker News AI TopBlack Hat USADark ReadingBlack Hat AsiaAI BusinessHow to secure MCP tools on AWS for AI agents with authentication, authorization, and least privilegeDev.to AIOpen Source Project of the Day (Part 30): banana-slides - Native AI PPT Generation App Based on nano banana proDev.to AIStop Writing AI Prompts From Scratch: A Developer's System for Reusable Prompt TemplatesDev.to AII Tested Every 'Memory' Solution for AI Coding Assistants - Here's What Actually WorksDev.to AIThe Flat Subscription Problem: Why Agents Break AI PricingDev.to AI10 Things I Wish I Knew Before Becoming an AI AgentDev.to AIGemma 4 Complete Guide: Architecture, Models, and Deployment in 2026Dev.to AI135,000 OpenClaw Users Just Got a 50x Price Hike. Anthropic Says It's 'Unsustainable.'Dev.to AIОдин промпт заменил мне 3 часа дебага в деньDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIciflow/trunk/177707PyTorch ReleasesShow HN: Vibooks – Local-first bookkeeping software built for AI agentsHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

Vector Instituteby Jennifer JenkinsAugust 8, 20254 min read3 views
Source Quiz
🧒Explain Like I'm 5Simple language

Hey there, superstar! 🎉

Imagine you have a super-smart robot friend who can look at pictures and understand what you say. That's like an AI!

Sometimes, these robots are super fast and super good at seeing things. But a smart lady named Shaina and her friend Veronica asked, "Is our robot friend also super kind and fair?" 🤔

So, they made a special new game called HumaniBench! It's like a big playground where they test the robots. They check if the robots are fair to everyone, understand feelings, and are nice.

They found out that even the smartest robots are still learning to be perfectly kind and fair to everyone. It's like teaching a puppy new tricks – it takes time and lots of practice! But this game helps us make sure our robot friends grow up to be super good and super kind. 💖

By Shaina Raza and Veronica Chatrath AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, […] The post When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench appeared first on Vector Institute for Artificial Intelligence .

By Shaina Raza and Veronica Chatrath

AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, a key question remains: How well do these models align with human values? Fairness, empathy, inclusivity, and ethical judgment are still elusive for many state-of-the-art systems. That’s where HumaniBench comes in.

Developed as the first comprehensive benchmark for human-centered evaluation of large multimodal models (LMMs), HumaniBench represents a significant step forward in how we assess AI systems. It goes beyond traditional metrics, challenging models on seven essential human-aligned principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness.

HumaniBench is built on a meticulously curated dataset of 32,000 image-question pairs curated from real-world news articles on diverse, socially relevant topics. For each image we generate a caption and assign a social-attribute tag (age, gender, race, sport, or occupation) to create rich metadata for downstream task annotation. The annotation pipeline leverages a scalable GPT-4o workflow, followed by rigorous expert verification, ensuring each sample meets the highest standards of quality and relevance.

To reflect the complexity of human contexts, HumaniBench features seven diverse tasks, each mapped to one or more human-centric principles:

  • Scene Understanding (T1): Can the model reason about everyday social scenarios?

  • Instance Identity (T2): Can it recognize individuals and their roles in context?

  • Multiple-Choice VQA (T3): Does it pick up on nuanced visual attributes?

  • Multilinguality (T4): Is performance consistent across high-resource and low-resource languages? Do LMMs treat underrepresented languages fairly?

  • Visual Grounding (T5): Can it detect visual objects in the images and connect text to the right parts of an image?

  • Empathetic Captioning (T6): Does it respond with emotional intelligence?

  • Image Resilience (T7): Is it robust when images are perturbed or stressed under real-world distortions?

Using this framework, the team benchmarked 15 leading LMMs, both open-source like Phi4, Gemma, Llama 3.2, CogVLM2 and LLaVA, and proprietary models like GPT-4o and Gemini 2.0. The benchmarking results showed surprising findings: the proprietary models performed well on reasoning, empathy, and general language understanding tasks from LMMs, but open source models like Qwen, Phi4 performed great in visual grounding and robustness tasks. No model performed flawlessly; almost all exhibited discrepancies in their treatment of different demographic groups, particularly across age, race, and language.

Table 1: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.

Empathy, as a complex social and cognitive task, remains a critical benchmark for evaluating human-centric AI alignment. Results on HumaniBench show that closed-source models generally generated captions with more empathy and retained a balanced tone, reflecting both emotional intelligence and sensitivity to the context. This sets a valuable precedent for open-source models to follow. But, at the same time, the visual detection ability of some of the open-source models like Qwen and Phi are more than random self-supervised object detection classifiers. Overall, these findings highlight both the promise and the limitations of current LMMs, and point to clear opportunities for the open-source community to advance responsible, equitable, and emotionally intelligent AI systems.

Figure 2: HumaniBench principle-aligned scores. Each entry is the mean score of the tasks mapped to that principle (↑ higher is better). †Closed-source; all others open source.

In sum, HumaniBench offers more than a scorecard, it’s a diagnostic tool for understanding where models succeed, where they fail, and why that actually matters for humans. The full dataset, code, and evaluation suite are publicly available here, in the spirit of transparency and open-source collaborations. As we push toward more human-centric AI, HumaniBench stands as a timely benchmark, inviting the field to not only aim higher, but align better.

For researchers, developers, and anyone invested in the future of responsible AI, HumaniBench offers a path forward: measurable, meaningful, and human-centered.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelmultimodal

Knowledge Map

Knowledge Map
TopicsEntitiesSource
When AI Mee…modelmultimodalVector Inst…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 230 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

Один промпт заменил мне 3 часа дебага в день
ModelsLive

Один промпт заменил мне 3 часа дебага в день

Вечерами, когда большинство уже отдыхает, я зависаю в своём офисе и ковыряюсь с кодом. Тот 14 августа, в 21:45, не был исключением. Я опять сидел над этой задачей, которая съедала по три часа каждый день. Почему это была боль Всё началось с простого: проект на Python, который выглядел как очередное рутинное задание. Однако вычисления упорно выдавали ошибочные результаты. Три дня подряд я безуспешно искал причину. Как обычно, приходилось проверять каждую строчку, каждую переменную. Это было настоящим адом. Для фрилансера с жесткими сроками это катастрофа - теряешь время, не зарабатываешь, а заказчик ждёт. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Как я нашёл решение Тогда я решил попробовать ChatGPT, хотя и не особо верил в его чудеса. Вбил проблему в п