Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessPeppa Pig and Transformers owner Hasbro hit by cyber-attackBBC TechnologyA New York Times reporter went to South Dakota to report on Kristi Noem's husband. Then the story broke.Business InsiderCognichip wants AI to design the chips that power AI, and just raised $60M to tryTechCrunch AISpaceX has reportedly filed for the biggest IPO in historyEngadgetThe Trump administration’s antitrust honeymoon is overThe Verge AIAnthropic vs OpenAI: An Investor ShiftBloomberg TechnologyApple turns 50: 8 of the company’s biggest tech milestonesSilicon RepublicSpaceX files confidentially for IPO in mega listing potentially valued at $1.75 trillion, report saysTechCrunch AII Built an AI Agent That Can Write Its Own Tools When It Gets StuckDEV CommunityBuilding a "Soft Sensor" for Cement Kilns: Predicting Control Levers with PythonDEV CommunityWe Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 SecondsDEV CommunityAgent Self-Discovery: How AI Agents Find Their Own WalletsDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessPeppa Pig and Transformers owner Hasbro hit by cyber-attackBBC TechnologyA New York Times reporter went to South Dakota to report on Kristi Noem's husband. Then the story broke.Business InsiderCognichip wants AI to design the chips that power AI, and just raised $60M to tryTechCrunch AISpaceX has reportedly filed for the biggest IPO in historyEngadgetThe Trump administration’s antitrust honeymoon is overThe Verge AIAnthropic vs OpenAI: An Investor ShiftBloomberg TechnologyApple turns 50: 8 of the company’s biggest tech milestonesSilicon RepublicSpaceX files confidentially for IPO in mega listing potentially valued at $1.75 trillion, report saysTechCrunch AII Built an AI Agent That Can Write Its Own Tools When It Gets StuckDEV CommunityBuilding a "Soft Sensor" for Cement Kilns: Predicting Control Levers with PythonDEV CommunityWe Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 SecondsDEV CommunityAgent Self-Discovery: How AI Agents Find Their Own WalletsDEV Community

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28708v1 Announce Type: new Abstract: This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations — Soutrik Mukherjee, Sangwhan Cha

View PDF

Abstract:This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.

Comments: 10 pages, 8 figures, 15 tables

Subjects:

Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

Cite as: arXiv:2603.28708 [cs.LG]

(or arXiv:2603.28708v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.28708

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sangwhan Cha [view email] [v1] Mon, 30 Mar 2026 17:27:33 UTC (713 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
GPU-Acceler…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 154 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers