Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessAI, Warfare, and Augmented Cities - Small Wars JournalGNews AI USAGamingtak Sony koopt start-up die foto s en video s omzet naar 3dTweakers.netChinese Chip Makers Hit Record Revenue on AI Boom, US Curbs - The Tech BuzzGNews AI ChinaMicrosoft Launches Three New AI Models to Advance Speech, Voice, and Image Capabilities - CXO DigitalpulseGNews AI voiceU.S. and China control 90% of AI data centres — the Global South is building a different kind of AI - Silicon CanalsGNews AI ChinaOpen Call: Accelerating AI Readiness and Adoption (United States) - fundsforNGOsGNews AI USA跳出幸存者偏差,从结构性资源分配解析财富真相Dev.to AIJapan s Sakura Internet jumps 20% as Microsoft plans $10 billion AI push with SoftBankCNBC TechnologyJapan's Sakura Internet jumps 20% as Microsoft plans $10 billion AI push with SoftBank - CNBCGNews AI JapanMicrosoft plans $10 billion investment in Japan to grow AI, train 1 million workers by 2030 - livemint.comGNews AI JapanOpenClaw vs Cloud AI: Which One Actually Gives Businesses More Control?Medium AI“In a World of AI Content, Being Human Is Your Superpower”Medium AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessAI, Warfare, and Augmented Cities - Small Wars JournalGNews AI USAGamingtak Sony koopt start-up die foto s en video s omzet naar 3dTweakers.netChinese Chip Makers Hit Record Revenue on AI Boom, US Curbs - The Tech BuzzGNews AI ChinaMicrosoft Launches Three New AI Models to Advance Speech, Voice, and Image Capabilities - CXO DigitalpulseGNews AI voiceU.S. and China control 90% of AI data centres — the Global South is building a different kind of AI - Silicon CanalsGNews AI ChinaOpen Call: Accelerating AI Readiness and Adoption (United States) - fundsforNGOsGNews AI USA跳出幸存者偏差,从结构性资源分配解析财富真相Dev.to AIJapan s Sakura Internet jumps 20% as Microsoft plans $10 billion AI push with SoftBankCNBC TechnologyJapan's Sakura Internet jumps 20% as Microsoft plans $10 billion AI push with SoftBank - CNBCGNews AI JapanMicrosoft plans $10 billion investment in Japan to grow AI, train 1 million workers by 2030 - livemint.comGNews AI JapanOpenClaw vs Cloud AI: Which One Actually Gives Businesses More Control?Medium AI“In a World of AI Content, Being Human Is Your Superpower”Medium AI
AI NEWS HUBbyEIGENVECTOREigenvector

Baby Scale: Investigating Models Trained on Individual Children's Language Input

arXiv cs.CLby Steven Y. Feng, Alvin W. M. Tan, Michael C. FrankApril 1, 20262 min read0 views
Source Quiz

arXiv:2603.29522v1 Announce Type: new Abstract: Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data

View PDF HTML (experimental)

Abstract:Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.

Comments: Code and data at this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.29522 [cs.CL]

(or arXiv:2603.29522v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29522

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Steven Y. Feng [view email] [v1] Tue, 31 Mar 2026 10:06:24 UTC (1,087 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Baby Scale:…modellanguage mo…benchmarktrainingannouncefeaturearXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 202 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models