Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessWashington state will require labels on AI images and set limits on chatbotsHacker News AI TopCan we ever trust AI to watch over itself?Hacker News AI TopAI models will scheme to protect other AI models from being shut downHacker News AI Topciflow/torchtitan/173837: Update on "[c10d] add profiling name to NCCL collective"PyTorch Releasesciflow/trunk/173837: Update on "[c10d] add profiling name to NCCL collective"PyTorch Releasesciflow/torchtitan/179229: [inductor] makes cuda 13.0 cross compliation works (#179229)PyTorch Releasesciflow/inductor/179229: [inductor] makes cuda 13.0 cross compliation works (#179229)PyTorch ReleasesShow HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloudHacker News AI Topciflow/torchtitan/177628: UpdatePyTorch Releasesciflow/trunk/177621: UpdatePyTorch Releases[R] ICML Anonymized git repos for rebuttalReddit r/MachineLearningciflow/torchtitan/179402PyTorch ReleasesBlack Hat USADark ReadingBlack Hat AsiaAI BusinessWashington state will require labels on AI images and set limits on chatbotsHacker News AI TopCan we ever trust AI to watch over itself?Hacker News AI TopAI models will scheme to protect other AI models from being shut downHacker News AI Topciflow/torchtitan/173837: Update on "[c10d] add profiling name to NCCL collective"PyTorch Releasesciflow/trunk/173837: Update on "[c10d] add profiling name to NCCL collective"PyTorch Releasesciflow/torchtitan/179229: [inductor] makes cuda 13.0 cross compliation works (#179229)PyTorch Releasesciflow/inductor/179229: [inductor] makes cuda 13.0 cross compliation works (#179229)PyTorch ReleasesShow HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloudHacker News AI Topciflow/torchtitan/177628: UpdatePyTorch Releasesciflow/trunk/177621: UpdatePyTorch Releases[R] ICML Anonymized git repos for rebuttalReddit r/MachineLearningciflow/torchtitan/179402PyTorch Releases
AI NEWS HUBbyEIGENVECTOREigenvector

A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Adversarial Examples are Just Bugs, Too

Distill PubAugust 6, 20191 min read0 views
Source Quiz

Refining the source of adversarial examples

We demonstrate that there exist adversarial examples which are just “bugs”: aberrations in the classifier that are not intrinsic properties of the data distribution. In particular, we give a new method for constructing adversarial examples which:

  • Do not transfer between models, and

  • Do not leak “non-robust features” which allow for learning, in the sense of Ilyas-Santurkar-Tsipras-Engstrom-Tran-Madry .

We replicate the Ilyas et al. experiment of training on mislabeled adversarially-perturbed images (Section 3.2 of ), and show that it fails for our construction of adversarial perturbations.

The message is, whether adversarial examples are features or bugs depends on how you find them — standard PGD finds features, but bugs are abundant as well.

We also give a toy example of a data distribution which has no “non-robust features” (under any reasonable definition of feature), but for which standard training yields a highly non-robust classifier. This demonstrates, again, that adversarial examples can occur even if the data distribution does not intrinsically have any vulnerable directions.

Background

Many have understood Ilyas et al. to claim that adversarial examples are not “bugs”, but are “features”. Specifically, Ilyas et al. postulate the following two worlds: As communicated to us by the original authors.

  • World 1: Adversarial examples exploit directions irrelevant for classification (“bugs”).

In this world, adversarial examples occur because classifiers behave poorly off-distribution, when they are evaluated on inputs that are not natural images. Here, adversarial examples would occur in arbitrary directions, having nothing to do with the true data distribution.

  • World 2: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.

Our main contribution is demonstrating that these worlds are not mutually exclusive — and in fact, we are in both. Ilyas et al. show that there exist adversarial examples in World 2, and we show there exist examples in World 1.

Constructing Non-transferrable Targeted Adversarial Examples

We propose a method to construct targeted adversarial examples for a given classifier $f$, which do not transfer to other classifiers trained for the same problem.

Recall that for a classifier $f$, an input example $(x, y)$, and target class $y_{targ}$, a targeted adversarial example is an $x’$ such that $||x - x’||\leq \eps$ and $f(x’) = y_{targ}$.

The standard method of constructing adversarial examples is via Projected Gradient Descent (PGD)

PGD is described in the appendix.

which starts at input $x$, and iteratively takes steps ${x_t}$ to minimize the loss $L(f, x_t, y_{targ})$. That is, we take steps in the direction xL(f,xt,ytarg)-\nabla_x L(f, x_t, y_{targ}) where $L(f, x, y)$ is the loss of $f$ on input $x$, label $y$.

Note that since PGD steps in the gradient direction towards the target class, we may expect these adversarial examples have feature leakage from the target class. For example, suppose we are perturbing an image of a dog into a plane (which usually appears against a blue background). It is plausible that the gradient direction tends to make the dog image more blue, since the “blue” direction is correlated with the plane class. In our construction below, we attempt to eliminate such feature leakage.

An illustration of the image-manifold for adversarially perturbing a dog to a plane. The gradient of the loss can be thought of as having an on-manifold “feature component” and an off-manifold “random component”. PGD steps along both components, hence causing feature-leakage in adversarial examples. Our construction below attempts to step only in the off-manifold direction.

Our Construction

Let ${f_i : \R^n \to \cY}_i$ be an ensemble of classifiers for the same classification problem as $f$. For example, we can let ${f_i}$ be a collection of ResNet18s trained from different random initializations.

For input example $(x, y)$ and target class $y_{targ}$, we perform iterative updates to find adversarial attacks — as in PGD. However, instead of stepping directly in the gradient direction, we step in the direction

Formally, we replace the iterative step with x_{t+1} \gets \Pi_\eps\left( x_t

  • \alpha( \nabla_x L(f, x_t, y_{targ}) + \E_i[ \nabla_x L(f_i, x_t, y)]) \right)$$ where $\Pi_\eps$ is the projection onto the $\eps$-ball around $x$._

(xL(f,xt,ytarg)+\Ei[xL(fi,xt,y)])-\left( \nabla_x L(f, x_t, y_{targ}) + \E_i[ \nabla_x L(f_i, x_t, y)] \right)

That is, instead of taking gradient steps to minimize $L(f, x, y_{targ})$, we minimize the “disentangled loss”

We could also consider explicitly using the ensemble to decorrelate, by stepping in direction $\nabla_x L(f, x, y_{targ}) - \E_i[ \nabla_x L(f_i, x, y_{targ})]$. This works well for small $\epsilon$, but the given loss has better optimization properties for larger $\epsilon$.

L(f,x,ytarg)+\Ei[L(fi,x,y)]L(f, x, y_{targ}) + \E_i[L(f_i, x, y)] This loss encourages finding an $x_t$ which is adversarial for $f$, but not for the ensemble ${f_i}$.

These adversarial examples will not be adversarial for the ensemble ${f_i}$. But perhaps surprisingly, these examples are also not adversarial for new classifiers trained for the same problem.

Experiments

We train a ResNet18 on CIFAR10 as our target classifier $f$. For our ensemble, we train 10 ResNet18s on CIFAR10, from fresh random initializations. We then test the probability that a targeted attack for $f$ transfers to a new (freshly-trained) ResNet18, with the same targeted class. Our construction yields adversarial examples which do not transfer well to new models.

For $L_{\infty}$ attacks:

Attack Success Transfer Success

PGD 99.6% 52.1%

Ours 98.6% 0.8%

For $L_2$ attacks:

Attack Success Transfer Success

PGD 99.9% 82.5%

Ours 99.3% 1.7%

Adversarial Examples With No Features

Using the above, we can construct adversarial examples which do not suffice for learning. Here, we replicate the Ilyas et al. experiment that “Non-robust features suffice for standard classification” (Section 3.2 of ), but show that it fails for our construction of adversarial examples.

To review, the Ilyas et al. non-robust experiment was:

  • Train a standard classifier $f$ for CIFAR.

  • From the CIFAR10 training set $S = {(X_i, Y_i)}$, construct an alternate train set $S’ = {(X_i^{Y_i \to (Y_i + 1)}, Y_i + 1)}$, where $X_i^{Y_i \to (Y_i +1)}$ denotes an adversarial example for $f$, perturbing $X_i$ from its true class $Y_i$ towards target class $Y_i+1 (\text{mod }10)$. Note that $S’$ appears to humans as “mislabeled examples”.

  • Train a new classifier $f’$ on train set $S’$. Observe that this classifier has non-trivial accuracy on the original CIFAR distribution.

Ilyas et al. use Step (3) to argue that adversarial examples have a meaningful “feature” component.

However, for adversarial examples constructed using our method, Step (3) fails. In fact, $f’$ has good accuracy with respect to the “label-shifted” distribution $(X, Y+1)$, which is intuitively what we trained on.

For $L_{\infty}$ attacks:

Test Acc on CIFAR: $(X, Y)$ Test Acc on Shifted-CIFAR: $(X, Y+1)$

PGD 23.7% 40.4%

Ours 2.5% 75.9%

Table: Test Accuracies of $f’$

For $L_2$ attacks:

Test Acc on CIFAR: $(X, Y)$ Test Acc on Shifted-CIFAR: $(X, Y+1)$

PGD 33.2% 27.3%

Ours 2.8% 70.8%

Table: Test Accuracies of $f’$

Adversarial Squares: Adversarial Examples from Robust Features

To further illustrate that adversarial examples can be “just bugs”, we show that they can arise even when the true data distribution has no “non-robust features” — that is, no intrinsically vulnerable directions.

We are unaware of a satisfactory definition of “non-robust feature”, but we claim that for any reasonable intrinsic definition, this problem has no non-robust features. Intrinsic here meaning, a definition which depends only on geometric properties of the data distribution, and not on the family of classifiers, or the finite-sample training set.

We do not use the Ilyas et al. definition of “non-robust features,” because we believe it is vacuous. In particular, by the Ilyas et al. definition, every distribution has “non-robust features” — so the definition does not discern structural properties of the distribution. Moreover, for every “robust feature” $f$, there exists a corresponding “non-robust feature” $f’$, such that $f$ and $f’$ agree on the data distribution — so the definition depends strongly on the family of classifiers being considered.

In the following toy problem, adversarial vulnerability arises as a consequence of finite-sample overfitting, and label noise.

The problem is to distinguish between CIFAR-sized images that are either all-black or all-white, with a small amount of random pixel noise and label noise.

A sample of images from the distribution.

Formally, let the distribution be as follows. Pick label $Y \in {\pm 1}$ uniformly, and let X :={(+1+η\eps)ηif Y=1(1+η\eps)ηif Y=1X := \begin{cases} (+\vec{\mathbb{1}} + \vec\eta_\eps) \cdot \eta & \text{if $Y=1$}\\ (-\vec{\mathbb{1}} + \vec\eta_\eps) \cdot \eta & \text{if $Y=-1$}\\ \end{cases}

where $\vec\eta_\eps \sim [-0.1, +0.1]^d$ is uniform $L_\infty$ pixel noise, and $\eta \in {\pm 1} \sim Bernoulli(0.1)$ is the 10% label noise.

A plot of samples from a 2D-version of this distribution is shown to the right.

Notice that there exists a robust linear classifier for this problem which achieves perfect robust classification, with up to $\eps = 0.9$ magnitude $L_\infty$ attacks. However, if we sample 10000 training images from this distribution, and train a ResNet18 to 99.9% train accuracy,

We optimize using Adam with learning-rate $0.00001$ and batch size $128$ for 20 epochs.

the resulting classifier is highly non-robust: an $\eps=0.01$ perturbation suffices to flip the class of almost all test examples.

The input-noise and label noise are both essential for this construction. One intuition for what is happening is: in the initial stage of training the optimization learns the “correct” decision boundary (indeed, stopping after 1 epoch results in a robust classifier). However, optimizing for close to 0 train-error requires a network with high Lipshitz constant to fit the label-noise, which hurts robustness.

Left: The training set (labels color-coded). Middle: The classifier after 10 SGD steps. Right: The classifier at the end of training. Note that it is overfit, and not robust.

Figure adapted from .

Addendum: Data Poisoning via Adversarial Examples

As an addendum, we observe that the “non-robust features” experiment of (Section 3.2) directly implies data-poisoning attacks: An adversary that is allowed to imperceptibly change every image in the training set can destroy the accuracy of the learnt classifier — and can moreover apply an arbitrary permutation to the classifier output labels (e.g. swapping cats and dogs).

To see this, recall that the original “non-robust features” experiment shows:Using our previous notation, and also using vanilla PGD to find adversarial examples.

  1. If we train on distribution $(X^{Y \to (Y+1)}, Y+ 1)$ the classifier learns to predict well on distribution $(X, Y)$.

By permutation-symmetry of the labels, this implies that:

  1. If we train on distribution $(X^{Y \to (Y+1)}, Y)$ the classifier learns to predict well on distribution $(X, Y-1)$.

Note that in case (2), we are training with correct labels, just perturbing the inputs imperceptibly, but the classifier learns to predict the cyclically-shifted labels. Concretely, using the original numbers of Table 1 in , this reduction implies that

an adversary can perturb the CIFAR10 train set by $\eps=0.5$ in $L_2$, and cause the learnt classifier to output shifted-labels 43.7% of the time (cats classified as birds, dogs as deers, etc).

This should extend to attacks that force arbitrary desired permutations of the labels.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

feature

Knowledge Map

Knowledge Map
TopicsEntitiesSource
A Discussio…featureDistill Pub

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products