New multimodal dataset will help in the development of ethical AI systems

Vector Instituteby Ian GormelyOctober 23, 20241 min read0 views

By Shaina Raza and Deval Pandya The Vector Institute’s AI Engineering team has developed Newsmediabias-plus (NMB+), a new multimodal dataset. It includes full-text articles alongside comprehensive publication details. It also [ ] The post New multimodal dataset will help in the development of ethical AI systems appeared first on Vector Institute for Artificial Intelligence .

By Shaina Raza and Deval Pandya

The Vector Institute’s AI Engineering team has developed Newsmediabias-plus (NMB+), a new multimodal dataset. It includes full-text articles alongside comprehensive publication details. It also features extensive bias categorization, addressing critical issues such as gender and racial biases, and specific topics including ideological leanings and framing, gender discrimination, and environmental concerns.

NMB+ is designed for academic researchers, NGOs, and socially focused groups. This is aligned to Vector’s goal of addressing both near- and long-term risks through the provision of practical tools for safe AI systems. Potential uses include:

Ensuring AI adheres to Vector’s AI trust and safety principles
Analyzing media trends and reporting styles across different outlets
Training AI to fairly detect and address disinformation in texts and images.

Developed by Shaina Raza, Vector Institute Applied Machine Learning Scientist, Responsible AI, the dataset builds on the previously released UnBIAS work by incorporating images alongside text.

Dataset features

The dataset includes around 90,000 news articles, curated from a broad spectrum of reliable sources, including major news outlets from around the globe, from May 2023 to September 2024. These articles were gathered through open data sources using Google RSS, adhering to research ethics guidelines.1, 2

Various machine learning models were built to evaluate the dataset’s effectiveness in detecting biases and fake content, demonstrating its versatility and utility. This benchmarking process shows how the dataset performs across different modalities, including text and images, highlighting its potential for training advanced AI models designed to combat disinformation.

Each entry in the dataset features full article text, publication details (date, outlet, URL), bias assessments for both text and images, as well as topic categorizations and image descriptions and analyses. A commitment to ethical AI governance requires designing transparent AI systems that can be understood and audited, holding developers accountable for the content their AI tools generate, and establishing clear ethical standards for the development and deployment of AI technologies. Developers and researchers should focus on building robust and transparent algorithms, integrating ethical considerations and personal information protection in data, and collaborating with experts across disciplines to enhance disinformation detection techniques. It also requires continuously adapting AI tools to counter evolving disinformation tactics.

NMB+’s development and use are governed by strict ethical standards to align regulatory requirements with technical work. Comprehensive human reviews have been implemented to ensure the accuracy and reliability of the data and its labels. The dataset underwent extensive audits to validate the data collection and labeling methodologies. These audits involve independent reviewers who assess the dataset for adherence to ethical standards and accuracy. They examine the data sources, collection procedures, and labeling criteria to ensure that all elements meet established research integrity and reliability guidelines. This thorough review helps to confirm that the dataset is both robust and trustworthy for use in training and evaluating AI systems.

Researchers, technologists, and the general public are invited to explore the NMB+ dataset and delve into the findings. The dataset is accessible on Vector’s Hugging Face page under a non-commercial license. The details can be found at News Media Bias Plus page.

References

[1] Does my data collection activity require ethics review? | Research | University of Waterloo

[2] What Can Open Data be Used For?

Original source

Vector Institute

https://vectorinstitute.ai/new-multimodal-dataset-will-help-in-the-development-of-ethical-ai-systems/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

multimodal

Frontier ResearchFresh

Fotor’s Joint Research Accepted by ICLR 2026, Advancing Its Agent’s Multimodal Reasoning - Yahoo Finance

Fotor’s Joint Research Accepted by ICLR 2026, Advancing Its Agent’s Multimodal Reasoning Yahoo Finance

GNews AI multimodal

1mabout 5 hours ago

Frontier Research

The Multimodal AI Guide: Vision, Voice, Text, and Beyond - KDnuggets

The Multimodal AI Guide: Vision, Voice, Text, and Beyond KDnuggets

GNews AI voice

1m2 months ago

ModelsFresh

[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Google DeepMind dropped Gemma 4 today: Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context Both are natively multimodal (text, image, video, dynamic resolution). We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful). Free playground if you want to test without spinning anything up: https://www.modular.com/#playground submitted by /u/carolinedfrasca [link] [comments]

Reddit r/MachineLearning

1mabout 4 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 157 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsFresh

Mercor, a $10 billion AI startup that works with companies including OpenAI and Anthropic, confirms major data breach

Mercor confirmed it was hit by a supply-chain attack targeting LiteLLM, a widely used AI developer tool. Extortion gang Lapsus$ claims to have walked away with four terabytes of data.

Fortune Tech

1mabout 3 hours ago

ProductsLive

Why Australia’s tech sovereignty needs smart partnerships

Geopolitical risk, cyber threats and outages are driving a rethink of how we build, run and protect the infrastructure powering the economy, argues Mark Hile, Datacom MD, Infrastructure Products. As someone entrusted with overseeing infrastructure products for a company that acts as a tech partner to hundreds of Australian organisations, both enterprise and government, the conversation around digital resilience , sovereignty and strengthening local infrastructure and networks has never felt more urgent – or more personal. After more than a decade working closely with Datacom’s customers, I believe our sector stands at an inflection point. We must either double down on building trusted, regionally-owned technology infrastructure or risk losing strategic control to offshore interests and unc

CIO Magazine

5mabout 1 hour ago

ProductsLive

Reddit is moving on from r/all

Reddit is deprecating r/all, one of its feeds that shows popular posts on the platform, as part of "ongoing efforts to simplify Reddit and improve Home feed personalization." Reddit has offered both r/popular and r/all as ways to see trending posts, with r/all being a "less filtered feed" where "sexually explicit posts are filtered out [ ]

The Verge AI

1m31 minutes ago

ProductsFresh

OpenAI is officially a media company, after buying Silicon Valley’s favorite podcast TBPN - MarketWatch

OpenAI is officially a media company, after buying Silicon Valley’s favorite podcast TBPN MarketWatch

Google News: OpenAI

1mabout 3 hours ago