New multimodal dataset will help in the development of ethical AI systems
By Shaina Raza and Deval Pandya The Vector Institute’s AI Engineering team has developed Newsmediabias-plus (NMB+), a new multimodal dataset. It includes full-text articles alongside comprehensive publication details. It also [ ] The post New multimodal dataset will help in the development of ethical AI systems appeared first on Vector Institute for Artificial Intelligence .
By Shaina Raza and Deval Pandya
The Vector Institute’s AI Engineering team has developed Newsmediabias-plus (NMB+), a new multimodal dataset. It includes full-text articles alongside comprehensive publication details. It also features extensive bias categorization, addressing critical issues such as gender and racial biases, and specific topics including ideological leanings and framing, gender discrimination, and environmental concerns.
NMB+ is designed for academic researchers, NGOs, and socially focused groups. This is aligned to Vector’s goal of addressing both near- and long-term risks through the provision of practical tools for safe AI systems. Potential uses include:
-
Ensuring AI adheres to Vector’s AI trust and safety principles
-
Analyzing media trends and reporting styles across different outlets
-
Training AI to fairly detect and address disinformation in texts and images.
Developed by Shaina Raza, Vector Institute Applied Machine Learning Scientist, Responsible AI, the dataset builds on the previously released UnBIAS work by incorporating images alongside text.
Dataset features
The dataset includes around 90,000 news articles, curated from a broad spectrum of reliable sources, including major news outlets from around the globe, from May 2023 to September 2024. These articles were gathered through open data sources using Google RSS, adhering to research ethics guidelines.1, 2
Various machine learning models were built to evaluate the dataset’s effectiveness in detecting biases and fake content, demonstrating its versatility and utility. This benchmarking process shows how the dataset performs across different modalities, including text and images, highlighting its potential for training advanced AI models designed to combat disinformation.
Each entry in the dataset features full article text, publication details (date, outlet, URL), bias assessments for both text and images, as well as topic categorizations and image descriptions and analyses. A commitment to ethical AI governance requires designing transparent AI systems that can be understood and audited, holding developers accountable for the content their AI tools generate, and establishing clear ethical standards for the development and deployment of AI technologies. Developers and researchers should focus on building robust and transparent algorithms, integrating ethical considerations and personal information protection in data, and collaborating with experts across disciplines to enhance disinformation detection techniques. It also requires continuously adapting AI tools to counter evolving disinformation tactics.
NMB+’s development and use are governed by strict ethical standards to align regulatory requirements with technical work. Comprehensive human reviews have been implemented to ensure the accuracy and reliability of the data and its labels. The dataset underwent extensive audits to validate the data collection and labeling methodologies. These audits involve independent reviewers who assess the dataset for adherence to ethical standards and accuracy. They examine the data sources, collection procedures, and labeling criteria to ensure that all elements meet established research integrity and reliability guidelines. This thorough review helps to confirm that the dataset is both robust and trustworthy for use in training and evaluating AI systems.
Researchers, technologists, and the general public are invited to explore the NMB+ dataset and delve into the findings. The dataset is accessible on Vector’s Hugging Face page under a non-commercial license. The details can be found at News Media Bias Plus page.
References
[1] Does my data collection activity require ethics review? | Research | University of Waterloo
[2] What Can Open Data be Used For?
Vector Institute
https://vectorinstitute.ai/new-multimodal-dataset-will-help-in-the-development-of-ethical-ai-systems/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
multimodal![[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-graph-nodes-a2pnJLpyKmDnxKWLd5BEAb.webp)
[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell
Google DeepMind dropped Gemma 4 today: Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context Both are natively multimodal (text, image, video, dynamic resolution). We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful). Free playground if you want to test without spinning anything up: https://www.modular.com/#playground submitted by /u/carolinedfrasca [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

Mercor, a $10 billion AI startup that works with companies including OpenAI and Anthropic, confirms major data breach
Mercor confirmed it was hit by a supply-chain attack targeting LiteLLM, a widely used AI developer tool. Extortion gang Lapsus$ claims to have walked away with four terabytes of data.

Why Australia’s tech sovereignty needs smart partnerships
Geopolitical risk, cyber threats and outages are driving a rethink of how we build, run and protect the infrastructure powering the economy, argues Mark Hile, Datacom MD, Infrastructure Products. As someone entrusted with overseeing infrastructure products for a company that acts as a tech partner to hundreds of Australian organisations, both enterprise and government, the conversation around digital resilience , sovereignty and strengthening local infrastructure and networks has never felt more urgent – or more personal. After more than a decade working closely with Datacom’s customers, I believe our sector stands at an inflection point. We must either double down on building trusted, regionally-owned technology infrastructure or risk losing strategic control to offshore interests and unc

Reddit is moving on from r/all
Reddit is deprecating r/all, one of its feeds that shows popular posts on the platform, as part of "ongoing efforts to simplify Reddit and improve Home feed personalization." Reddit has offered both r/popular and r/all as ways to see trending posts, with r/all being a "less filtered feed" where "sexually explicit posts are filtered out [ ]




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!