Click the emoji for the Final Writeup: 😈

Click the emoji for our GitHub: 🐙

Title

Adversarial Attack for Image Classification

Who

Sudatta Hor (shor1), Michael Donoso (mdonoso), Kyler Hwa (khwa)

Introduction

Neural networks are vulnerable to adversarial attacks. Attackers can intentionally design inputs to cause the model to make mistakes. In our work, we attack convolutional neural networks (CNN) used to classify images in the MNIST dataset.

Many adversarial attacks exist (Biggio, Szegedy’s L-BFGS, Goodfellow’s FGSM), and are able to generate adversarial examples quickly. However, the defense strategy (distilled defense) proposed by (Papernot et al. 2016) was shown to be successful against these attacks. An attack proposed by Carlini and Wagner counterattacks this distilled defense strategy.

In our work, we implement Carlini and Wagner’s attack. The Carlini Wagner attack is a state-of-the-art attack in the field of adversarial machine learning, and has been frequently used to benchmark the robustness of machine learning models.

Related Work

Data

We will generate adversarial examples on the MNIST dataset.

Methodology

For image classification, we are using simple CNNs following the model architecture and parameters as in Carlini and Wagner, 2016. The models can be trained on our local machines, or using the Google Colab resources.

The most difficult part of the project is the algorithm to generate adversarial examples (a set of perturbed MNIT digits). The paper presents 3 different attack methods that vary based on the distance metric used (L0, L2, and L∞). The L2 method is straightforward because it is differentiable and we can use gradient descent methods, but the other two use less obvious algorithms.

We may also implement the distillation defense model or a simpler attack method (Biggio, Szegedy’s L-BFGS Attack, FGSM) to compare with Carlini and Wagner’s algorithm.

Metrics

What constitutes “success?" Our work is successful if we can generate perturbed MNIST images that are misclassified by the model. The attacks can be targeted, where the goal is for the model to return a target t, or can be untargeted, where the goal is for the model to return any incorrect classification.

What experiments do you plan to run? We plan to experiment with different types of attacks using different distance metrics (L0, L2, and L∞). We may also implement the distillation defense model or a simpler attack method (Biggio, Szegedy’s L-BFGS Attack, FGSM) to compare with Carlini and Wagner’s algorithm.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? For the non targeted attack, accuracy does apply, but our goal is to reduce the accuracy of the underlying model. We want to generate a set of adversarial images that significantly decreases the accuracy of the model. Accuracy is also applicable to the targeted attack, but in a different sense. Rather than measure the model’s accuracy against the correct labels, we want the model to predict the incorrect label in a high percentage of trials for each targeted attack.

If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. The authors aimed to propose a new adversarial attack algorithm that can generate imperceptible adversarial examples. In other words, they were hoping to generate adversarial examples that can cause a target model to misclassify inputs while maintaining a low perturbation distance. Lastly, they compare their results to other attacks (some obsolete because of their new method!).

  • Base goal: Generate a set of perturbed MNIST images that successfully attacks an MNIST classifier.
  • Target goal: Generate multiple sets of perturbed images that successfully attack an MNIST classifier and compare the results between different attacks.
  • Stretch goal: Generate multiple sets of perturbed images that successfully attack an MNIST classifier, implement a defense algorithm against these attacks, and compare the results between the attacks/defense.

Ethics

What broader societal issues are relevant to adversarial attacks and the work Carlini Wagner has done?

Adversarial attacks, including the attacks Carlini and Wagner propose, have broad societal implications because the machine learning systems that they deceive have such a large range of applications (e.g., from self-driving cars to fraud detection). Specifically, if an attacker can generate such an imperceptible perturbation to an input and successfully deceive a system, it could mean a self-driving car accident or enabling a fraudster to bypass a security system.

Another societal issue that adversarial attacks touch upon is the proliferation of bias and discrimination in machine learning outcomes. Namely, attacks can manipulate a model to produce discriminatory results, such as misidentifying individuals of certain races in facial recognition models.

These papers serve as indicators for the need of more robust and secure machine learning systems, especially as adoption becomes more widespread. We will likely witness an “arms race” between attackers and engineers as new techniques and models surface.

How are you planning to quantify or measure error or success? What implications does quantification have?

When considering adversarial attacks such as the Carlini Wagner attack, the most common metrics are success rate, perturbation distance, and the transferability rate.

The success rate refers to the percentage of adversarial examples that are classified as the target class (e.g., MNIST digits) by the neural network. It provides the effectiveness of the attack. A high rate indicates that the example is effective in deceiving the neural network.

Perturbation distance is the amount of distortion introduced to the original input (via the attack). This metric provides insights on how perceptible the adversarial example is. For success, a low perturbation distance indicates that the example is visually similar and therefore more difficult to detect.

Finally, the transferability rate is the percentage of adversarial examples generated by one DNN architecture that can be successfully classified by another DNN architecture. In other words, adversarial examples created for one model can often be transferred to another model, even if they are trained on different datasets and utilize different algorithms. Transferability provides a sense of generalizability of the attack.

The implications detail the interpretability of the attack. Take for example a high success rate and a low perturbation distance, which indicates the attack is effective while also being imperceptible. Lastly, a high transferability rate indicates that the attack is a threat to multiple systems.

Division of Labor

Between three group members, each of us will choose an adversarial attack/defense algorithm to implement. There are many attacks/defenses, and comparing any three of them will be insightful.


Midpoint Reflection

Introduction

Neural networks are vulnerable to adversarial attacks. Attackers can intentionally design inputs to cause the model to make mistakes. In our work, we attack convolutional neural networks (CNN) used to classify images in the MNIST dataset.

Many adversarial attacks exist (Biggio, Szegedy’s L-BFGS, Goodfellow’s FGSM), and are able to generate adversarial examples quickly. However, the defense strategy (distilled defense) proposed by (Papernot et al. 2016) was shown to be successful against these attacks. An attack proposed by Carlini and Wagner counterattacks this distilled defense strategy.

In our work, we implement Carlini and Wagner’s attack. The Carlini Wagner attack is a state-of-the-art attack in the field of adversarial machine learning, and has been frequently used to benchmark the robustness of machine learning models.

Challenges

The hardest part of the project so far has been understanding the Carlini and Wagner attack paper. The paper is both mathematical and technical and it was difficult to understand all of the details. We also had some trouble understanding the existing codebase, as it uses a deprecated version of Tensorflow. Hence, it took some time to revert back to an old version of Tensorflow to play around with their codebase. Insights

Insights

At this point, we have accuracy for our CNN on MNIST data. Our model is performing as expected, and achieves 99% accuracy on the second epoch,.

We are planning on wrapping up the L2 adversarial attack, and comparing the loss between the original MNIST predictions and the adversarial examples.

Plan

Currently, we are on track with our project. We need to dedicate more time to fleshing out the L2 attack and choosing potential visualizations. We are intending to display examples of the original MNIST image with its classification, the perturbation, and the adversarial attack’s misclassification. We are also thinking about plotting a line graph comparing the loss function across epochs for the CNN on the original MNIST dataset and on the adversarial examples.

Additionally, we are considering including an analysis of other attacks and distillation models, as well as discussing transferability, where attacks generated on one model are applied to another model.

Click the emoji for the Final Writeup: 😈

Click the emoji for our GitHub: 🐙

Built With

Share this project:

Updates