SERSEM: Selective Entropy-Weighted Scoring for Membership Inference in Code Language Models
arXiv:2604.01147v1 Announce Type: new Abstract: As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become critical. We propose SERSEM (Selective Entropy-Weighted Scoring for Membership Inference), a novel white-box attack framework that suppresses uninformative syntactical boilerplate to amplify specific memorization signals. SERSEM utilizes a dual-signal methodology: first, a continuous character-level weight mask is derived through static Abstract Syntax Tree (AST) analysis, spellchecking-based multilingual logic detection, and offline linting. Second, these heuristic weights are used to pool internal transformer activations and calibrate token-level
View PDF HTML (experimental)
Abstract:As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become critical. We propose SERSEM (Selective Entropy-Weighted Scoring for Membership Inference), a novel white-box attack framework that suppresses uninformative syntactical boilerplate to amplify specific memorization signals. SERSEM utilizes a dual-signal methodology: first, a continuous character-level weight mask is derived through static Abstract Syntax Tree (AST) analysis, spellchecking-based multilingual logic detection, and offline linting. Second, these heuristic weights are used to pool internal transformer activations and calibrate token-level Z-scores from the output logits. Evaluated on a 25,000-sample balanced dataset, SERSEM achieves a global AUC-ROC of 0.7913 on the StarCoder2-3B model and 0.7867 on the StarCoder2-7B model, consistently outperforming the implemented probability-based baselines Loss, Min-K% Prob, and PAC. Our findings demonstrate that focusing on human-centric coding anomalies provides a significantly more robust indicator of verbatim memorization than sequence-level probability averages.
Comments: Accepted to the FSE 2026 Poisoned Chalice Competition
Subjects:
Software Engineering (cs.SE); Cryptography and Security (cs.CR)
Cite as: arXiv:2604.01147 [cs.SE]
(or arXiv:2604.01147v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2604.01147
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Semih Çağlar [view email] [v1] Wed, 1 Apr 2026 17:03:58 UTC (915 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modeltransformer
Speculative decoding works great for Gemma 4 31B in llama.cpp
I get a ~11% speed up with Gemma 3 270B as the draft model. Try it by adding: --no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 Testing with (on a 3090): ./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 --jinja --temp 1.0 --top-p 0.95 --top-k 64 -ngl 1000 -st -f prompt.txt --no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 Gave me: [ Prompt: 607.3 t/s | Generation: 36.6 t/s ] draft acceptance rate = 0.44015 ( 820 accepted / 1863 generated) vs. [ Prompt: 613.8 t/s | Generation: 32.9 t/s ] submitted by /u/Leopold_Boom [link] [comments]

Gemma 4 - 4B vs Qwen 3.5 - 9B ?
Hello! anyone tried the 4B Gemma 4 model and the Qwen 3.5 9B model and can tell us their feedback? On the benchmark Qwen seems to be doing better, but I would appreciate any personal experience on the matter Thanks! submitted by /u/No-Mud-1902 [link] [comments]

Speed difference on Gemma 4 26B-A4B between Bartowski Q4_K_M and Unsloth Q4_K_XL
I've noticed this on Qwen3.5 35B before as well, there is a noticeable speed difference between Unsloth's Q4_K_XL and Bartowski's Q4_K_M on the same model, but Gemma 4 seems particularly harsh in this regard: Bartowski gets 38 tk/s, Unsloth gets 28 tk/s... everything else is the same, settings wise. This is with the latest Unsloth quant update and latest llama.cpp version. Their size is only ~100 MB apart. Anyone have any idea why this speed difference is there? Btw, on Qwen3.5 35B I noticed that Unsloth's own Q4_K_M was also a bit faster than the Q4_K_XL, but there it was more like 39 vs 42 tk/s. submitted by /u/BelgianDramaLlama86 [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Speed difference on Gemma 4 26B-A4B between Bartowski Q4_K_M and Unsloth Q4_K_XL
I've noticed this on Qwen3.5 35B before as well, there is a noticeable speed difference between Unsloth's Q4_K_XL and Bartowski's Q4_K_M on the same model, but Gemma 4 seems particularly harsh in this regard: Bartowski gets 38 tk/s, Unsloth gets 28 tk/s... everything else is the same, settings wise. This is with the latest Unsloth quant update and latest llama.cpp version. Their size is only ~100 MB apart. Anyone have any idea why this speed difference is there? Btw, on Qwen3.5 35B I noticed that Unsloth's own Q4_K_M was also a bit faster than the Q4_K_XL, but there it was more like 39 vs 42 tk/s. submitted by /u/BelgianDramaLlama86 [link] [comments]

Gemma 4 - 4B vs Qwen 3.5 - 9B ?
Hello! anyone tried the 4B Gemma 4 model and the Qwen 3.5 9B model and can tell us their feedback? On the benchmark Qwen seems to be doing better, but I would appreciate any personal experience on the matter Thanks! submitted by /u/No-Mud-1902 [link] [comments]

Kokoro TTS running on-device, CPU-only, 20x realtime!!!
I wanted a reading app where you could read, read and listen or just listen to books with word-by-word highlighting synced to TTS and i wanted the voice to actually sound good. This turned out to be a really hard challenge with Kokoro on iOS, here's what I ran into: Using MLX Swift is great but uses Metal. iOS kills Metal access the moment you background the app. If your use case needs background audio, this is a dead end. ONNX Runtime on CPU fixes the background problem, but the monolithic Kokoro model only runs at 2-3x realtime. After 30 minutes of sustained generation my phone was scorching hot. What actually worked: I split the monolithic model into a multi-stage pipeline and replaced part of the synthesis with native code on Apple's Accelerate framework. That got it to 20x realtime on



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!