A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Learning from Incorrectly Labeled Data
Section 3.2 of Ilyas et al. (2019) shows that training a model on only adversarial errors leads to non-trivial generalization on the original test set. We show that these experiments are a specific case of learning from errors.
Section 3.2 of Ilyas et al. (2019) shows that training a model on only adversarial errors leads to non-trivial generalization on the original test set. We show that these experiments are a specific case of learning from errors. We start with a counterintuitive result — we take a completely mislabeled training set (without modifying the inputs) and use it to train a model that generalizes to the original test set. We then show that this result, and the results of Ilyas et al. (2019), are a special case of model distillation. In particular, since the incorrect labels are generated using a trained model, information about the trained model is being “leaked” into the dataset. We begin with the following question: what if we took the images in the training set (without any adversarial perturbations) and mislabeled them? Since the inputs are unmodified and mislabeled, intuition says that a model trained on this dataset should not generalize to the correctly-labeled test set. Nevertheless, we show that this intuition fails — a model can generalize. We first train a ResNet-18 on the CIFAR-10 training set for two epochs. The model reaches a training accuracy of 62.5% and a test accuracy of 63.1%. Next, we run the model on all of the 50,000 training data points and relabel them according to the model’s predictions. Then, we filter out all the correct predictions. We are now left with an incorrectly labeled training set of size 18,768. We show four examples on the left of the Figure below:
1
We then randomly initialize a new ResNet-18 and train it only on this mislabeled dataset. We train for 50 epochs and reach an accuracy of 49.7% on the original test set. The new model has only ever seen incorrectly labeled, unperturbed images but can still non-trivially generalize.
This is Model Distillation Using Incorrect Predictions
How can this model and the models in Ilyas et al. (2019) generalize without seeing any correctly labeled data? Here, we show that since the incorrect labels are generated using a trained model, information is being “leaked” about that trained model into the mislabeled examples. In particular, this an indirect form of model distillation — training on this dataset allows a new model to somewhat recover the features of the original model.
We first illustrate this distillation phenomenon using a two-dimensional problem. Then, we explore other peculiar forms of distillation for neural networks — -we transfer knowledge despite the inputs being from another task.
Two-dimensional Illustration of Model Distillation
We construct a dataset of adversarial examples using a two-dimensional binary classification problem. We generate 32 random two-dimensional data points in [0,1]2[0,1]^2 and assign each point a random binary label. We then train a small feed-forward neural network on these examples, predicting 32/32 of the examples correctly (panel (a) in the Figure below).
2
Next, we create adversarial examples for the original model using an l∞l_{\infty} ball of radius ϵ=0.12\epsilon=0.12. In panel (a) of the Figure above, we display the ϵ\epsilon-ball around each training point. In panel (b), we show the adversarial examples which cause the model to change its prediction (from correct to incorrect). We train a new feed-forward neural network on this dataset, resulting in the model in panel (c)._
Although this new model has never seen a correctly labeled example, it is able to perform non-trivially on the original dataset, predicting 23/3223/32 of the inputs correctly (panel (d) in the Figure). The new model’s decision boundary loosely matches the original model’s decision boundary, i.e., the original model has been somewhat distilled after training on its adversarial examples. This two-dimensional problem presents an illustrative version of the intriguing result that distillation can be performed using incorrect predictions.
Other Peculiar Forms of Distillation
Our experiments show that we can distill models using mislabeled examples. In what other peculiar ways can we learn about the original model? Can we use only out-of-domain data?
We train a simple CNN model on MNIST, reaching 99.1% accuracy. We then run this model on the FashionMNIST training set and save its argmax predictions. The resulting dataset is nonsensical to humans — a “dress” is labeled as an “8″.
3
We then initialize a new CNN model and train it on this mislabeled FashionMNIST data. The resulting model reaches 91.04% accuracy on the MNIST test set. Furthermore, if we normalize the FashionMNIST images using the mean and variance statistics for MNIST, the model reaches 94.5% accuracy on the MNIST test set. This is another instance of recovering a functionally similar model to the original despite the new model only training on erroneous predictions.
Summary
These results show that training a model using mislabeled adversarial examples is a special case of learning from prediction errors. In other words, the perturbations added to adversarial examples in Section 3.2 of Ilyas et al. (2019) are not necessary to enable learning.
Response Summary: Note that since our experiments work across different architectures, “distillation” in weight space does not occur. The only distillation that can arise is “feature space” distillation, which is actually exactly our hypothesis. In particular, feature-space distillation would not work in World 1 — if the adversarial examples we generated did not exploit useful features, we should not have been able to “distill” a useful model from them. (In fact, one might think of normal model training as just “feature distillation” of the humans that labeled the dataset.) Furthermore, the hypothesis that all we need is enough model-consistent points in order to recover a model, seems to be disproven by Preetum’s “bugs-only dataset” and other (e.g. ) settings.
Response: Since our experiments work across different architectures, “distillation” in weight space cannot arise. Thus, from what we understand, the “distillation” hypothesis suggested here is referring to “feature distillation” (i.e. getting models which use the same features as the original), which is actually precisely our hypothesis too. Notably, this feature distillation would not be possible if adversarial examples did not rely on “flipping” features that are good for classification (see World 1 and World 2) — in that case, the distilled model would only use features that generalize poorly, and would thus generalize poorly itself.
Moreover, we would argue that in the experiments presented (learning from mislabeled data), the same kind of distillation is happening. For instance, a moderately accurate model might associate “green background” with “frog” thus labeling “green” images as “frogs” (e.g., the horse in the comment’s figure). Training a new model on this dataset will thus associate “green” with “frog” achieving non-trivial accuracy on the test set (similarly for the “learning MNIST from Fashion-MNIST” experiment in the comment). This corresponds exactly to learning features from labels, akin to how deep networks “distill” a good decision boundary from human annotators. In fact, we find these experiments a very interesting illustration of feature distillation that complements our findings.
We also note that an analogy to logistic regression here is only possible due to the low VC-dimension of linear classifiers (namely, these classifiers have dimension dd). In particular, given any classifier with VC-dimension kk, we need at least kk points to fully specify the classifier. Conversely, neural networks have been shown to have extremely large VC-dimension (in particular, bigger than the size of the training set ). So even though labelling d+1d+1 random points model-consistently is sufficient to recover a linear model, it is not necessarily sufficient to recover a deep neural network. For instance, Milli et al. are not able to reconstruct a ResNet-18 using only its predictions on random Gaussian inputs. (Note that we are using a ResNet-50 in our experiments.)
Finally, it seems that the only potentially problematic explanation for our experiments (namely, that enough model-consistent points can recover a classifier) is disproved by Preetum’s experiment. In particular, Preetum is able to design a dataset where training on mislabeled inputs that are model-consistent does not at all recover the decision boundary of the original model. More generally, the “model distillation” perspective raised here is unable to distinguish between the dataset created by Preetum below, and those created with standard PGD (as in our D^det\widehat{\mathcal{D}}{det} and D^rand\widehat{\mathcal{D}}{rand} datasets).
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modeltrainingfeatureA Very Fine Untuning
How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system

Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
![[Side A] Completely Defending Python from OOM Kills: The BytesIO Trap and D-MemFS 'Hard Quota' Design Philosophy](https://media2.dev.to/dynamic/image/width=1200,height=627,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vney0xtkjc0fo4kadmp.png)
[Side A] Completely Defending Python from OOM Kills: The BytesIO Trap and D-MemFS 'Hard Quota' Design Philosophy
<blockquote> <p><strong>From the Author:</strong><br> Recently, I introduced <strong>D-MemFS</strong> on Reddit. The response was overwhelming, confirming that memory management and file I/O performance are truly universal challenges for developers everywhere. This series is my response to that global interest.</p> </blockquote> <h3> 🧭 About this Series: The Two Sides of Development </h3> <p>To provide a complete picture of this project, I’ve split each update into two perspectives:</p> <ul> <li> <strong>Side A (Practical / from Qiita):</strong> Implementation details, benchmarks, and technical solutions.</li> <li> <strong>Side B (Philosophy / from Zenn):</strong> The development war stories, AI-collaboration, and design decisions.</li> </ul> <h2> Introduction </h2> <p>If you write in-mem
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
Inside the hours when coders tore through Claude's guts and found pets, spinner verbs, and a curse chart - Business Insider
<a href="https://news.google.com/rss/articles/CBMipwFBVV95cUxNMl9tQk1icUo1Tzc4YXU2NmpOMjFUcEZtdEhNLVdSZ0kxaUtTRUtmbXdLOXV0MDJiSWFHTk5YYzhCS2dxT2dxTzFPdzBlZVU3MTBTcl94RnhPT1liNnExbzJXYVo1Y0NWRllfWUI1RkdENi00d09CMmNhU013WWhnWnNrSXQ0Y3FFMVRTdFpCcEdZby1XWEJmWXUxZWR5S295ekVPTGxscw?oc=5" target="_blank">Inside the hours when coders tore through Claude's guts and found pets, spinner verbs, and a curse chart</a> <font color="#6f6f6f">Business Insider</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxQNU1NTGdtSzlqckVfLU5oMkoxamNrWjNya0oxOFNFQ3Q1aEdnSUlDN0lYY3ZKNEc5U0QtMmRXbFVyZEdpTHBwcllmbVFPZmR2WmlBSzh1TmJUN0tncFIzdXNmR2ZLNUtjc1NtYjlSNHJfMzI4NUFpQ3ZIVnJSRlJvVmZpa2U4dHJOTnVlMWVUZE9lSUNqN2ZuT3FzeE15TTlzZDIwWjkxVFhIS0hua2JDQm5pRjVBcWZhVjV5bkZIR2YxcmdkczFxMEJCcTEwQ2pQS2dhakVjdjRwOXhkbmFZV2dEM3dqUllySHJ6LXZtR21PNnRUQWxBVE11MjZ6ZkRmczNjbjAzLWlhZkFDZEJ3dkRiMnhybFhhYlluQVYtNUswX096NFNlOVptZzQ0VlB4bmx2a1ZQV0M0VE5sVDNKMWQtV1BlUzFxNENBaWYxNmlpOHdpbjVvWHZnZ2JVWndwbUFwbGRNSXhCRHFxMG53c09LZ3JkLUREb1FRLV8wcGptei0xemlKSDd4aU1oWnlkRUlwbzNaY196dmtoa1BIVlF3?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxPazllT0hscUhNZFpyV1hBcXozd3FYb0pVaVJ1Qk84V2VvYnVPUDExV1VRTnN0dnpndFROYkhEOEpLU2tJWldUUFA5LVZ2a2F3SlFkeEMteW01bTYyOEs3NlNvNjlqd2VYb0oxMkRFaXMzekZPRUxvZTZHZ3V6Q0dfTDF4dlA1TC0tNW04RmxRWGoyQ2RkRHlwSEJwdmVaQW1xNDVmMGxxN0Nxa25odlFXYnJFNW5POE9ENkdfQkM5MUNERzBVX2E3em9IRUVKVGV4VVE2NnF4OV95dmRZZk9nZ0pvTTdHSVRxVk1nZW5DV0lrcG1lT2VKSmRLNk1uMDFWQnVaOFg2eEZNQWltXzZYQTh4TmVnS0JSZ3M3dUp2Umc5LTZ5emlWLWVvWmFZNEhMcklabnE2Y3J6SW93bXZYZmt4VmRILXBieC1wckhZUmlNakJsUVVHNUk1ZnRTcF9CdnJ3MEU3a2dfVm9aX25xYkN3dVF5bWlBQzZSc19LSlNZSUFGVHUyNlJUS1djYy1fbm5WSDhEUVNWN1dOVFR3X1FB?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">WSJ</font>
Android Auto left behind as ChatGPT comes to CarPlay - Android Authority
<a href="https://news.google.com/rss/articles/CBMihgFBVV95cUxQYzVSM3loaldFMW5RWlp3Qi1FenJLYkJJT0lBYVBYN1JyQlgxSHozbnJwVzBhZ1MyeDkyLUR0b2k1Y211ZFRwcC1ubW5fOXcyakhXRDU4OF9TMDVjdG00VDJOOUFDVzRiN21yZ2NBa1UwdEJqTE4yLS03anV0TEdjNVp3NHprdw?oc=5" target="_blank">Android Auto left behind as ChatGPT comes to CarPlay</a> <font color="#6f6f6f">Android Authority</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!