Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning
arXiv:2511.21075v2 Announce Type: replace-cross Abstract: Aligning Large Language Models (LLMs) with biomedical knowledge requires understanding both concepts and causal mechanisms in scientific reports. Supervised Fine-Tuning (SFT) often fails to capture these logical structures, while Reinforcement Learning (RL) is limited by sparse reward signals. We propose Balanced Fine-Tuning (BFT), a dual-scale post-training method that stabilizes training via confidence-weighted token-level optimization and adaptively emphasizes knowledge-dense hard samples using minimum group confidence. Experiments o — Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Yaokun Li, Jiehui Huang, Dawei Huang, Zhi Song, Jianhua Yao
Authors:Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Yaokun Li, Jiehui Huang, Dawei Huang, Zhi Song, Jianhua Yao
View PDF HTML (experimental)
Abstract:Aligning Large Language Models (LLMs) with biomedical knowledge requires understanding both concepts and causal mechanisms in scientific reports. Supervised Fine-Tuning (SFT) often fails to capture these logical structures, while Reinforcement Learning (RL) is limited by sparse reward signals. We propose Balanced Fine-Tuning (BFT), a dual-scale post-training method that stabilizes training via confidence-weighted token-level optimization and adaptively emphasizes knowledge-dense hard samples using minimum group confidence. Experiments on medical and biological reasoning benchmarks show that BFT consistently outperforms SFT and achieves competitive or superior performance to specialized systems such as GeneAgent. Beyond improving generative accuracy, BFT enhances the fidelity of LLM-generated biomedical entity descriptions, such that their embeddings produced by standard encoders outperform those from domain-specific biological foundation models. This enables a single post-trained LLM to support both reasoning generation and representation-based biological analysis. Overall, BFT provides a concise and effective framework for aligning LLMs with biomedical knowledge while bridging generative and representational capabilities.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: arXiv:2511.21075 [cs.LG]
(or arXiv:2511.21075v2 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2511.21075
arXiv-issued DOI via DataCite
Submission history
From: Zhenchao Tang [view email] [v1] Wed, 26 Nov 2025 05:34:26 UTC (5,636 KB) [v2] Fri, 27 Mar 2026 03:36:42 UTC (4,721 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv![[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-earth-satellite-QfbitDhCB2KjTsjtXRYcf9.webp)
[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry
Hi r/MachineLearning , I’m looking for an arXiv endorser in cs.LG for a paper on inference-time distribution shift detection for deployed LLMs. The core idea: instead of monitoring input embeddings (which is what existing tools do), we monitor the statistical manifold of the model’s output distributions using Fisher-Rao geodesic distance. We then run adaptive CUSUM (Page-Hinkley) on the resulting z-score stream to catch slow drift that per-request spike detection misses entirely. The methodology is grounded in published work on information geometry (Figshare, DOIs available). We’ve validated the signal on real OpenAI API logprobs, CUSUM caught gradual domain drift in 7 steps with zero false alarms during warmup, while spike detection missed it entirely. If anyone with cs.LG endorsement is

How can Beijing attract top-tier Chinese AI professionals based abroad?
Beijing should shift its strategy and improve ways to attract and retain top Chinese AI professionals as America’s accelerating integration of artificial intelligence into military and national security systems puts such talent in a bind. As geopolitical tensions rise, many highly skilled Chinese researchers working at US tech and research institutions are confronting a painful dilemma, according to Dai Mingjie, a researcher at the Institute of Public Policy at the Guangzhou-based South China...
Anthropic says Claude subscriptions will no longer support OpenClaw because it puts an 'outsized strain' on systems
Why It Matters The decision by Anthropic to stop supporting OpenClaw for Claude subscriptions is significant because it highlights the challenges of integrating third-party tools with AI systems. According to a report from Business Insider, Anthropic cited the "outsized strain" that tools like OpenClaw put on their systems as the reason for this move. This strain is likely due to the additional computational resources required to support these tools, which can impact the overall performance and reliability of the AI system. The impact of this decision will be felt by users who rely on OpenClaw to enhance their experience with Claude subscriptions. OpenClaw's founder has already expressed disappointment, stating that cutting support would be "a loss." This reaction is understandable, given
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
![[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-earth-satellite-QfbitDhCB2KjTsjtXRYcf9.webp)
[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry
Hi r/MachineLearning , I’m looking for an arXiv endorser in cs.LG for a paper on inference-time distribution shift detection for deployed LLMs. The core idea: instead of monitoring input embeddings (which is what existing tools do), we monitor the statistical manifold of the model’s output distributions using Fisher-Rao geodesic distance. We then run adaptive CUSUM (Page-Hinkley) on the resulting z-score stream to catch slow drift that per-request spike detection misses entirely. The methodology is grounded in published work on information geometry (Figshare, DOIs available). We’ve validated the signal on real OpenAI API logprobs, CUSUM caught gradual domain drift in 7 steps with zero false alarms during warmup, while spike detection missed it entirely. If anyone with cs.LG endorsement is
![[D] KDD Review Discussion](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-wave-pattern-4YWNKzoeu65vYpqRKWMiWf.webp)
[D] KDD Review Discussion
KDD 2026 (Feb Cycle) reviews will release today (4-April AoE), This thread is open to discuss about reviews and importantly celebrate successful reviews. Let us all remember that review system is noisy and we all suffer from it and this doesn't define our research impact. Let's all prioritise reviews which enhance our papers. Feel free to discuss your experiences submitted by /u/BomsDrag [link] [comments]




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!