From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes

Attacks vs. unintentional failure modes

- [Instructor] If one of the apps on your phone crashes, do you immediately worry, oh no, my phone is under attack by cyber criminals? Or do you maybe shake it off thinking it's a problem with one of your apps, or maybe your phone was just due for a reboot? The reality is that apps and systems can fail for both intentional and unintentional reasons. Both of these types of failures are important because each type can impact overall AI security and reliability. The primary taxonomy we'll use to walk through failure modes is Microsoft and Harvard University designed, and it includes both intentional and unintentional failures. As we discuss each type of failure, we'll highlight what each one is and how it can occur. AI is moving quickly and standards continue to emerge. Throughout these modules, we'll highlight two other important taxonomies, MITRE Atlas, an AI focused compliment to the classic Mitre attack framework and the OWASP top 10 for LLM applications 2025. Since these taxonomies are attack focused, they primarily target intentional failures, i.e., attacks, so we'll cover them primarily in the chapters on unintentional failures. It's important to understand both intentional and unintentional failures. While both can lead to system malfunction, the underlying root causes can require different mitigation approaches. Intentional failures constitute attacks that are adversarial in nature. In other words, someone trying, on purpose, to disrupt the system or use it to their own advantage. A common adversarial attack that's familiar to most people is ransomware. Many intentional attacks against AI exploit features unique to these technologies. In a predictive system or systems that analyze past data for trends, like ML driven image classification systems, this can mean causing it to return a false or incorrect result. A classic way to accomplish this is via perturbation, introducing noise to an image. For example, by changing the right pixels in a way not visible to the human eye, an attacker could cause a photo of a penguin to appear to an AI classification system as a bicycle. For generative systems or systems that create new content based on learned patterns like ChatGPT or Claude, an attack could take the form of prompt injections where an attacker directly or indirectly hijacks legitimate prompts to bring about unwanted behavior. For example, sending directives to override the AI's behavioral restrictions or guardrails. To create a system that is resilient to intentional attacks, developers, designers, and defenders, must understand the types of attacks and how they work in order to build the right controls. The other way that AI can fail is unintentionally, without anyone deliberately causing it. Going back to our crashing phone example, it could be that the designers didn't stress test the app. In this case, no attacker actively tried to make the app crash, but by not testing it thoroughly, the app failed under normal conditions. Common corruption or the introduction of natural perturbations of noise, is an example of unintentional failure in AI. This unintentional failure could be disastrous in certain circumstances. For example, passengers in a self-driving car that uses computer vision to read road signs could be put at risk if weather conditions or normal wear and tear render a sign unreadable to the AI. Stop signs have been designed to be easily recognizable by humans in varying lighting and weather conditions. But if parts of the sign have worn off with age or if they become obscured by rain or snow, chances are good you and I would still recognize that red octagon as a stop sign. But if the autonomous vehicle's AI powered classification was only trained to recognize the ideal case, then the weather corrupted sign might not be recognized as a stop sign at all. What this means is, without taking common corruption into account, an old stop sign could be the difference between life and death.

Contents