Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessHow to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn WorkflowsMarkTechPostComparing Today's Multi-Model DatabasesDEV CommunityBuilding a WeChat Mini Program Pre-Sale System from Scratch: A Builder's LogDEV CommunityOpenAI sees a new round of executive shake-upsBusiness Insider26 Quizzes: What We've Learned About Which Results People Actually ShareDEV CommunityLayered Agentic Retrieval for Retail Floor Questions: A Solo PoCDEV CommunityHow to Handle Sensitive Data Securely in TerraformDEV CommunitySecure Cross-Platform File Sharing: A Unified Solution for Diverse Devices and NetworksDEV CommunityHere's what 'cracking' bitcoin in 9 minutes by quantum computers actually meansCoinDesk AIShow HN: Travel Hacking Toolkit – Points search and trip planning with AIHacker NewsAnthropic says Claude subscriptions will no longer support OpenClaw because it puts an 'outsized strain' on systemsBusiness InsiderI Tested a Real AI Agent for Security. The LLM Knew It Was Dangerous — But the Tool Layer Executed Anyway.DEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessHow to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn WorkflowsMarkTechPostComparing Today's Multi-Model DatabasesDEV CommunityBuilding a WeChat Mini Program Pre-Sale System from Scratch: A Builder's LogDEV CommunityOpenAI sees a new round of executive shake-upsBusiness Insider26 Quizzes: What We've Learned About Which Results People Actually ShareDEV CommunityLayered Agentic Retrieval for Retail Floor Questions: A Solo PoCDEV CommunityHow to Handle Sensitive Data Securely in TerraformDEV CommunitySecure Cross-Platform File Sharing: A Unified Solution for Diverse Devices and NetworksDEV CommunityHere's what 'cracking' bitcoin in 9 minutes by quantum computers actually meansCoinDesk AIShow HN: Travel Hacking Toolkit – Points search and trip planning with AIHacker NewsAnthropic says Claude subscriptions will no longer support OpenClaw because it puts an 'outsized strain' on systemsBusiness InsiderI Tested a Real AI Agent for Security. The LLM Knew It Was Dangerous — But the Tool Layer Executed Anyway.DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification

JMLRby Aleksi Avela, Pauliina IlmonenJanuary 1, 20261 min read0 views
Source Quiz

Text classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes - known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feat

Aleksi Avela, Pauliina Ilmonen; 27(18):1−28, 2026.

Abstract

Text classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes - known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feature space) is likely to grow as well. We introduce a novel Markov chain based text oversampling method. The transition probabilities are estimated from the minority class but also partly from the majority class, thus allowing the minority feature space to expand in oversampling. We evaluate our approach against prominent oversampling methods and show that our approach is able to produce highly competitive results against the other methods in several real data examples, especially when the imbalance is severe.

[abs][pdf][bib]        [code]

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

trainingfeature

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Extrapolate…trainingfeatureJMLR

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 175 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products