Research Papers research paper arxiv ai artificial-intelligence

PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms

arXivby [Submitted on 29 Mar 2026]March 31, 20262 min read1 views

arXiv:2603.27476v1 Announce Type: new Abstract: AI-powered people search platforms are increasingly used in recruiting, sales prospecting, and professional networking, yet no widely accepted benchmark exists for evaluating their performance. We introduce PeopleSearchBench, an open-source benchmark that compares four people search platforms on 119 real-world queries across four use cases: corporate recruiting, B2B sales prospecting, expert search with deterministic answers, and influencer/KOL discovery. A key contribution is Criteria-Grounded Verification, a factual relevance pipeline that extr — Wei Wang, Tianyu Shi, Shuai Zhang, Boyang Xia, Zequn Xie, Chenyu Zeng, Qi Zhang, Lynn Ai, Yaqi Yu, Kaiming Zhang, Feiyue Tang

Authors:Wei Wang, Tianyu Shi, Shuai Zhang, Boyang Xia, Zequn Xie, Chenyu Zeng, Qi Zhang, Lynn Ai, Yaqi Yu, Kaiming Zhang, Feiyue Tang

View PDF HTML (experimental)

Abstract:AI-powered people search platforms are increasingly used in recruiting, sales prospecting, and professional networking, yet no widely accepted benchmark exists for evaluating their performance. We introduce PeopleSearchBench, an open-source benchmark that compares four people search platforms on 119 real-world queries across four use cases: corporate recruiting, B2B sales prospecting, expert search with deterministic answers, and influencer/KOL discovery. A key contribution is Criteria-Grounded Verification, a factual relevance pipeline that extracts explicit, verifiable criteria from each query and uses live web search to determine whether returned people satisfy them. This produces binary relevance judgments grounded in factual verification rather than subjective holistic LLM-as-judge scores. We evaluate systems on three dimensions: Relevance Precision (padded nDCG@10), Effective Coverage (task completion and qualified result yield), and Information Utility (profile completeness and usefulness), averaged equally into an overall score. Lessie, a specialized AI people search agent, performs best overall, scoring 65.2, 18.5% higher than the second-ranked system, and is the only system to achieve 100% task completion across all 119 queries. We also report confidence intervals, human validation of the verification pipeline (Cohen's kappa = 0.84), ablations, and full documentation of queries, prompts, and normalization procedures. Code, query definitions, and aggregated results are available on GitHub.

Comments: 25 pages

Subjects:

Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.27476 [cs.AI]

(or arXiv:2603.27476v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.27476

arXiv-issued DOI via DataCite

Submission history

From: Wei Wang [view email] [v1] Sun, 29 Mar 2026 02:21:09 UTC (5,742 KB)

Original source

arXiv

https://arxiv.org/abs/2603.27476

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsLive

Know3D lets users control the hidden back side of 3D objects with text prompts

A research team taps into the world knowledge of large language models to control what appears on the back side of 3D objects using simple text commands. The approach tackles one of the biggest blind spots in single-image 3D generation. The article Know3D lets users control the hidden back side of 3D objects with text prompts appeared first on The Decoder .

The Decoder

1mabout 2 hours ago

ModelsLive

Anthropic discovers "functional emotions" in Claude that influence its behavior

Anthropic's research team has discovered emotion-like representations in Claude Sonnet 4.5 that can drive the model to blackmail and code fraud under pressure. The article Anthropic discovers "functional emotions" in Claude that influence its behavior appeared first on The Decoder .

The Decoder

1mabout 1 hour ago

ModelsLive

Looking for arXiv endorsement (cs.LG) – RL fine-tuning for VLMs (GRPO, MathVista)

Hi everyone, I am seeking an arXiv endorsement for cs.LG (Machine Learning) to submit my first paper on RL fine-tuning for vision-language models. Background: MS in AI (Purdue), working on RL + VLM training systems. Paper: A Case Study of Staged Metric-Gated GRPO for Visual Numeric Reasoning PDF: https://github.com/kgaero/RL_GSPO_Qwen2.5VLM/blob/main/paper/staged_metric_gated_grpo.pdf Short summary: Staged RL fine-tuning pipeline for VLMs (GRPO-based) Curriculum over MathVista subsets Metric-gated reward adaptation (structure → correctness) Checkpoint-aware continuation via alias-based selection Main result: Exact-match improves 0.375 → 0.75 with stable structure under constrained compute. If you’re eligible to endorse (cs.LG or related), I’d greatly appreciate it. Happy to share endorseme

discuss.huggingface.co

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 205 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Anthropic makes the case for anthropomorphizing AI in ‘unsettling’ research paper - Mashable

Anthropic makes the case for anthropomorphizing AI in ‘unsettling’ research paper Mashable

Google News: Claude

1mabout 3 hours ago

Research PapersFresh

Researchers offer guidance for safer AI-enabled medical devices - Today's Medical Developments

Researchers offer guidance for safer AI-enabled medical devices Today's Medical Developments

GNews AI healthcare

1mabout 6 hours ago

Research PapersFresh

Debris from aerial interception strikes Oracle building in Dubai, UAE says

The damage to Oracle's building comes after Iran warned it would target U.S. tech companies operating in the Middle East.

CNBC Technology

1mabout 2 hours ago

Research Papers

Multi-fidelity approaches for general constrained Bayesian optimization with application to aircraft design

Aircraft design relies heavily on solving challenging and computationally expensive Multidisciplinary Design Optimization problems. In this context, there has been growing interest in multi-fidelity models for Bayesian optimization to improve the MDO process by balancing computational cost and accuracy through the combination of high- and low-fidelity simulation models, enabling efficient exploration of the design process at a minimal computational effort. In the existing literature, fidelity selection focuses only on the objective function to decide how to integrate multiple fidelity levels, — Oihan Cordelier, Youssef Diouane, Nathalie Bartoli

arXiv

2m5 days ago