Streamlined Open-Vocabulary Human-Object Interaction Detection
arXiv:2603.27500v1 Announce Type: new Abstract: Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solel — Chang Sun, Dongliang Liao, Changxing Ding
View PDF HTML (experimental)
Abstract:Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at this https URL.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27500 [cs.CV]
(or arXiv:2603.27500v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27500
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Chang Sun [view email] [v1] Sun, 29 Mar 2026 03:31:56 UTC (4,647 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv![[D] Physicist-turned-ML-engineer looking to get into ML research. What's worth working on and where can I contribute most?](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-quantum-N2hdoEfCm2gAozJVRfL5wL.webp)
[D] Physicist-turned-ML-engineer looking to get into ML research. What's worth working on and where can I contribute most?
After years of focus on building products, I'm carving out time to do independent research again and trying to find the right direction. I have stayed reasonably up-to-date regarding major developments of the past years (reading books, papers, etc) ... but I definitely don't have a full understanding of today's research landscape. Could really use the help of you experts :-) A bit more about myself: PhD in string theory/theoretical physics (Oxford), then quant finance, then built and sold an ML startup to a large company where I now manage the engineering team. Skills/knowledge I bring which don't come as standard with Physics: Differential Geometry Topology (numerical solution of) Partial Differential Equations (numerical solution of) Stochastic Differential Equations Quantum Field Theory
Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)
Brevity Constraints Reverse Performance Hierarchies in Language Models
Large language models can underperform smaller ones due to verbose responses that introduce errors, but constraining output length reveals their superior capabilities and improves performance across benchmarks. (16 upvotes on HuggingFace)
Reasoning Shift: How Context Silently Shortens LLM Reasoning
Reasoning behaviors in large language models compress under varied contextual conditions, potentially impacting performance on complex tasks despite maintaining accuracy on simpler ones. (22 upvotes on HuggingFace)


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!