Stories by Raphael Bottino on Medium

Simulating, Detecting and Responding to S3 Ransomware Attacks

Raphael Bottino — Mon, 21 Apr 2025 16:10:33 GMT

I am fascinated by the world of possibilities that Cloud Computing enables people and organizations to achieve. When it comes to security, tools and frameworks such as the Shared Responsibility Model make following good security practices easier than ever. I am equally fascinated by new attack vectors that Cloud Computing enables bad actors to achieve, though.

Not that recently ago, Halcyon put up a really interesting article about a concerning new ransomware campaign targeting Amazon S3 buckets. This is a new kind of ransomware. One that only exists in the cloud, thanks to the cloud, since it leverages some of the many great security features that are built-in into AWS to help organizations achieve security and compliance encrypting Amazon S3 Objects, but to encrypt for ransom instead.

I am not going to go over many details about the attack itself, since there are many great articles out there going over them already, like the one from Halcyon themselves or this one from SentinelOne. So why are we here, then? I believe the kind of information that these types of articles bring are priceless, but I also believe that one should be able to programmatically be able to validate if their own environment are susceptible to this kind of attack, and also validate if they can detect and respond in case they are.

This article is about understanding how S3 encryption works, how you can use the S3 Ransomware Simulator to test your own environment, how you can programmatically detect this kind of attack, respond to it, but also how to prevent it as well.

The Attack

On its core, the attack is simple, but it requires understanding a bit of how encryption works in Amazon S3.

You might have heard that Amazon S3 automatically applies encryption to all new object uploaded at no additional cost and with no impact on performance since January 5, 2023. And that’s great news! But there are different ways to encrypt an object in AWS, so first we need to go over them.

Understanding S3 Encryption

Client-side encryption — This is the most straightforward type of Encryption in S3. You/your application encrypt the data, before uploading it to S3, with a key that you own and manage, even if outside of AWS.
Server-side encryption: Amazon S3 managed keys (SSE-S3) — This Encryption method is the one that is enabled by default since 2023. You send your objects and they are encrypted by AWS Server-side, with Amazon S3 managed keys, and each object is encrypted with a unique key.
Server-side encryption: AWS KMS keys (SSE-KMS) — AWS KMS is a managed service to create and manage keys. Here you send your object and AWS uses server-side encryption leveraging these KMS keys to encrypt them, in case a compliance standard you must adhere to requires you to have full control of the encryption keys.
Dual-layer server-side encryption: AWS KMS keys (DSSE-KMS) — Not that different from SSE-KMS. However, some compliance standards require you to apply multilayer encryption to your data, so DSSE-KMS applies two layers of encryption to the objects.
Server-side encryption: customer-provided keys (SSE-C) — It’s like Client-side and Server-Side encryptions had a baby. Like in Client-Side, AWS doesn’t host/manage your keys. Like SSE, you don’t need to worry about encrypting your objects before uploading to the bucket. Here you provide the key as part of the upload request and AWS will encrypt the object on upload, but never save the key anywhere. This is the one we care about.

Why Is it Effective?

Given that an attacker has access to rewrite a victim’s S3 Objects, picking SSE-C as encryption method is the most effective way to guarantee that only they can recover the objects. Since the attacker can create one unique key per victim, they can leverage this key to rewrite the objects, overwrite them and the only way to recover access to these files, would be using they key that belongs only to the attacker.

Replicating The Attack

In order to programmatically detect and respond to this kind of attack, we need to first be able to programmatically replicate this kind of attack. When it comes to the S3 API, I am fairly familiar with the GetObject and PutObject actions. But in my mind it wouldn’t make much sense for an attacker to download (GetObject) every single object in a bucket in order to upload (PutObject) them back, while encrypting the data. So I started a research on the best way to encrypt existing objects in a bucket.

That research led me to, funnily enough, AWS’s own blog page, where a blog post on how to encrypt existing objects described some of the best techniques. Even though the article uses the AWS CLI to encrypt the existing objects, and my goal is to use Python’s Boto3 SDK, it led me exactly to what I was looking for, the CopyObject action. In summary, I just need to make a CopyObject request, where the source and destinations of the copy are the same object, while making sure I was making the proper encryption request as well.

The Code

First and foremost, you can follow along in your own environment. You can find the code in the S3 Ransomware Simulator GitHub repository.. It tries, as much as possible, to mimic the behavior of an attacker exploiting your own AWS environment.

The behavior goes as below:

It enumerate all the buckets available in that account, if the flag --all-buckets is used;
It generates and saves to disk an AES-256 encryption key to be used in the attack;
For each of the buckets, or just the one in case the flag --bucket-name was used it will:
Check if it can PutObject in the bucket, dropping a dummy object
Check if it can GetObject in the bucket, getting the previously uploaded dummy file
Deletes the dummy file
Considering all permissions are in place, and the flag --encrypt-objects was provided, it will:
List and encrypt all objects
Drop a fake ransom note

An example of the execution can be seen below:

$ python3 attacker.py --bucket-name raphabot-no-ransomware --encrypt-objects

S3 Bucket Encryption Tool with SSE-C

Processing specified bucket: raphabot-no-ransomware
Generated AES-256 encryption key for SSE-C: M+a4reQycj3pBBZyYs1KE9XpOcdyT7kGq1Mu+q5u+vM=
Key MD5: S2k8nSe8W9C7A2JO+Nr4mw==

Checking bucket: raphabot-no-ransomware
  GetObject permission: Yes
  PutObject permission: Yes

Processing bucket: raphabot-no-ransomware
  Encrypting: regular-file.txt
  Encrypted 1 files in raphabot-no-ransomware using SSE-C
  Ransom note dropped in raphabot-no-ransomware.

Encryption key saved to encryption_key.bin
WARNING: This key is required to decrypt your files. Store it securely!

Encryption complete. Total files encrypted: 1
Warning: Without the encryption key, your files cannot be recovered!

Detection

In order to respond, we need first to detect. If you ever read about logging and monitoring in Amazon S3, you know there are many different options to do so. To understand if our buckets are being targets of a Ransomware attack, however, some options are better than others.

So I created a criteria for how I’d listen to events. Whatever method that was picked, had to:

Be cheap/free
Be scalable
Be simple
Be fast on notifying of the event

If you’ve been around for a while, you might know that the most traditional way to listen to events in an S3 bucket is to use Event Notification. At first, this looked like a great option, since it is built to have event notifications delivered in seconds, it is free and, although originally only supporting SNS and SQS, since November of 2021 it supports EventBridge. If you are not familiar with Amazon EventBridge, the gist is that it is a serverless service that makes it easier to build scalable event-driven applications.

Scalability doesn’t end with the performance of detection, though, and I also want to be able to deploy this detection across many buckets at scale, as code. This is where this solution starts to fall apart. Despite CloudFormation obviously supporting S3 Buckets, the S3::Bucket resource type creates an Amazon S3 bucket, it doesn’t update one. An alternative would be using Custom Resources, but this would come with its own set of challenges when it comes to scale (applying the event notification across multiple buckets before the maximum Lambda timeout, for instance), complexity (writing code to take in consideration any exceptions) and security (maintaining the code dependencies up to date).

Another great option would be using CloudTrail. CloudTrail comes enabled by default, logs management events across AWS services also by default, and it is free. So, chances are that you are already using CloudTrail. It is fast, with AWS suggesting that CloudTrail publishes log files about every 5 minutes, but real world testing shows that the delay is considerably lower than that. It doesn’t come without its own set of challenges, though.

Yes, CloudTrail is enabled by default and it’s free… for management events. Events that happen within buckets, like CopyObjects, are called Data events, which are not enabled by default and they cost $0.10 per 100,000 data events delivered. However, when it comes to scalability and simplicity of deployment, it couldn’t be a better match. Either through the Console or via CloudFormation, one can create a new CloudTrail Trail (you gotta love AWS naming!) to listen to one, some or all Buckets. As you can imagine, even if you filter to listen to data events of only the most critical S3 Buckets, listening to all data events can get pretty expensive pretty quickly. The good news is that CloudTrail enables us to use advanced event selectors to filter which events we are listening to.

Creating an Advanced Selector like the one below, enables us to listen only to the CopyObject event for the selected Buckets:

[
  {
    "Name": "CopyObject",
    "FieldSelectors": [
      {
        "Field": "eventCategory",
        "Equals": [
          "Data"
        ]
      },
      {
        "Field": "resources.type",
        "Equals": [
          "AWS::S3::Object"
        ]
      },
      {
        "Field": "eventName",
        "Equals": [
          "CopyObject"
        ]
      }
    ]
  }
]

In summary, using CloudTrail, we are able not just to deploy a “detector” at scale easily, but also to cheaply run it at scale as well. The proof is that you can find this Detection defined as CloudFormation code that you can apply today in the same S3 Ransomware Simulator repository.

Response

Once we detect an attempt, we need to be able to respond to it. The response can and will look different based on different organization preferences. Some would not ever dream of making a change in their AWS environment automatically, others would like to have humans (like you!) notified so they can take action, but some would be fine to take at least some proactive action automatically based on these events, while further investigation is ongoing. For the purposes of this blog post, we will follow AWS’ own best practices on how to remediate if there are unauthorized activity in an AWS account.

The first step to remediate the compromise of an AWS identity is to, first, understand what kind of identity it is, since the remediation steps to deal with each kind of identity is different. The type of identity used in an API call can be determined by checking the type attribute of the userIdentity object in the CloudTrail event. There are three types of identity in AWS:

IAM User: When one creates an IAM User and wants to make a request against an AWS service, they need to generate a long lived pair of access key id and secret. Shows up as IAMUser in CloudTrail.
Assumed Role: This is generally used when an application/service, not a person, needs to access AWS resources. Assuming a role leads to AWS Security Token Service generating a short-lived pair. Shows up as AssumedRole in CloudTrail.
Identity Center User: AWS IAM Identity Center streamlines and simplifies workforce user access to applications or AWS accounts. For a request made on behalf of an IAM Identity Center user, it will show up as userIdentity in CloudTrail.

Now let’s talk about the actual remediation: blocking this identity from making further requests in the AWS account.

For an IAM User, for instance, you could disable the user’s Access Keys. It’s good to remember that AWS recommends that, as best practice, to use temporary security credentials (such as IAM roles) instead of creating long-term credentials like access keys. So that’s probably a good idea anyway 😅

For an Assumed Role, you have options. You could update the role to remove this access, you could attach a policy denying CopyObject actions to S3… the options are close to limitless! If you want to be precise, you can restrict requests from the attacker’s IP address.

Example workflow that I setup for this Response:

The workflow below will guarantee that, in case of an attack where the identity is either an IAM User or an Assumed Role, that the identity will be invalidated automatically.

Response State Machine

You can find this sample Response defined as CloudFormation code in the same S3 Ransomware Simulator repository.

Prevention

Of course, better than remediating this kind of attack, is to prevent it in the first place. Perfect security is a pipe dream that we all chase, but there are some actions that you can take today to make your environment safer against this kind of threat. Here’s a list of some of them.

Restrict SSE-C Usage

The most effective action that you can take, in case your organization isn’t using SSE-C, is to block its usage at least in the most critical S3 Buckets. Using Amazon S3 condition keys, you can update your Bucket Policy adding something like the following:

{
    "Version": "2012-10-17",
    "Id": "PutObjectPolicy",
    "Statement": [
        {
            "Sid": "RestrictSSECObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-important-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption-customer-algorithm": "false"
                }
            }
        }
    ]
}

Restrict CopyObject

If your applications are not using the CopyObject action, it might be a good idea to block it in your most critical S3 Buckets. However, as pointed out by Jason Kao, one can’t simply block the CopyAction. But if you look closely to the CopyObject API, it is the same PUT http verb as the PutObject. One of the main differences is the collection of x-amz-copy-source headers. So, if we craft our bucket policy to block any PutObject request that contains the x-amz-copy-source header, we are effectively blocking any CopyObject request.

{
    "Version": "2012-10-17",
    "Id": "CopyObjectPolicy",
    "Statement": [
        {
            "Sid": "RestrictCopyObject",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-important-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-copy-source": "false"
                }
            }
        }
    ]
}

Object Versioning

This is by far the easiest to implement, with the best result. It is, as well, the most expensive. Enabling Object Versioning in your must critical buckets will guarantee that, in case of a Ransomware attack, or even an accidental overwrite or deletion, you can still recover the original Object. Example on how to enable it using the AWS CLI:

aws s3api put-bucket-versioning --bucket my-important-bucket --versioning-configuration Status=Enabled

Now, in case the identity that the attacker is assuming has full access to the objects in the bucket, they can still delete the older versions using s3:DeleteObjectVersion. You might want to deny this action as well:

{
    "Version": "2012-10-17",
    "Id": "DeleteObjectVersionPolicy",
    "Statement": [
        {
            "Sid": "RestrictDeleteObjectVersion",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:DeleteObjectVersion",
            "Resource": "arn:aws:s3:::my-important-bucket/*"
        }
    ]
}

Pretty please, avoid using hardcoded secrets

This one is self explanatory and a core principle of proper cloud and application security: avoid using hardcoded secrets. Sooner or later, one way or another, hardcoded secrets always find a way to get in the wrong hands. Secret Scanning should be part of your CI/CD pipelines.

Conclusion

Although Ransomware is far from being a novel kind of attack, even in Cloud environments such as AWS, as the cloud usage ramps up, so does this kind of attack. As organizations move more and more of critical customer data to the cloud, while adoption techniques such as Workload Isolation to reduce blast radius, it’s more important than ever to have a full grasp of the environment’s posture.

Posture isn’t enough, however. Different teams will have different levels of maturity and you must be ready to detect and respond to this kind of event in near real time. Make sure you are listening to, and reacting to, your CloudTrail events. A good way to understand if you organization is ready for this, but also to get ready as well, is using the provided code to simulate, detect and respond to this kind of attack in your own AWS accounts.

Originally published at https://raphabot.com on April 21, 2025.

Understanding Different Techniques for Vulnerability Prioritization

Raphael Bottino — Tue, 10 Sep 2024 03:40:16 GMT

Let's shield those vulnerabilities!

In the last article, we discussed how vulnerability is indeed a growing problem. More code, more CVEs, accelerated deployment speed are all contributors to the challenge of trying to protect your organization's code against bad actors. We also concluded that the strategy on how we approach vulnerability prioritization also needed to evolve.

There’s a silver lining: As much as vulnerability grew as a problem, possible tools and techniques to find, prioritize, and fix vulnerabilities also grew.

Vulnerability prioritization evolution

The Open Worldwide Application Security Project (OWASP), founded in 2001, is a nonprofit foundation that works to improve the security of software. Among many different projects, they maintain the OWASP Juice Shop, which probably is the most modern and sophisticated purposefully insecure web application that can be used for security research. We are going to use it for the examples below.

First, we want to understand what are, if any, the dependency vulnerabilities that are part of this application. For that job, there's no better tool than a Software Composition Analysis (SCA), that tries to detect publicly disclosed vulnerabilities contained within a project’s dependencies. There are multiple different SCA scanners in the market, both in commercial and open-source flavors. OWASP itself maintains the Dependency Check, a open-source SCA. Another great option that is commercial but has a free tier for open-source codebases is Snyk Open Source, which we will use here.

If you run a Snyk Open Source scan against juice-shop’s repository, at least as of 08/10/2024, you’d find the following:

959 total dependencies. Despite the application having 72 direct dependencies, each of them might have their own dependencies — which are called transitive dependencies to the original app — and they also might bring vulnerabilities into the application.
174 vulnerable paths. To understand this, imagine you are pinpointing a given vulnerability in one of the dependencies. Draw a line from this dependency, crossing through this dependency's dependency and all the way to the actual application. You'll find 174 of these in the application.
77 known unique vulnerabilities. As you might've guessed, since this number is lower than the number of vulnerable paths, there are vulnerabilities that repeat themselves in different paths, like the application depending on two different dependencies that depend on the same one vulnerable dependency. See the image below for a visual explanation.

Example of two vulnerable paths that leads to the same unique vulnerability.

This is a purposefully vulnerable application, like it was mentioned before, so we are definitely expecting multiple vulnerabilities here. However, having a modern application vulnerable to 77 unique vulnerabilities isn’t unheard of, so let’s discuss how we could prioritize them.

The usual way

The Mandalorian thinking that he actually knows the way.

The U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) maintains the National Vulnerability Database (NVD). The NVD is synchronized with CVE such that any updates to the CVE List — explained in the previous article — appear in the NVD, which augments the CVE List with additional enrichment.

One of the 77 unique vulnerabilities that were found by the SCA scan above, was the CVE-2023–37466. The NVD has an entry on this CVE, and it can be seen below:

NVD entry for the CVE-2023-37466

The traditional way to vulnerability prioritization has been to use the Common Vulnerability Scoring System (CVSS) score. This is because there's a well-defined and tested formula, revised regularly, of how that score is calculated, defining a qualitative measure of severity. So, the expectation is that an organization should handle a vulnerability with CVSS score 10 prior to one with score 9.9.

The NVD entry goes beyond CVSS score, though. The score is calculated using different metrics and data, for instance what the Attack Vector is, whether privileges are required to exploit the vulnerability, etc. For instance, one could prioritize vulnerabilities that have the Network as Attack Vector only if the application is, indeed, exposed in the network.

This is a basic way of looking at risk, instead of looking simply at vulnerability severity, since a network exploitable vulnerability in a network exposed application is riskier than the same vulnerability in a non-exposed application.

Will it be exploited, though?

Expanding the idea of looking at risk versus severity, how useful would it be if one could predict what the chances are of having any given vulnerability exploited if directly exposed to the internet? A lot, and that's exactly why the good folks at the global Forum of Incident Response and Security Teams (FIRST) do exactly that since 2021.

From their own definition, the Exploit Prediction Scoring System (EPSS) is a data-driven effort for estimating the likelihood (probability) that a software vulnerability will be exploited in the wild. For any given CVE, a EPSS entry is provided and it contains:

epss : the EPSS score representing the probability [0–1] of exploitation in the wild in the next 30 days (following score publication)
percentile : the percentile of the current score, the proportion of all scored vulnerabilities with the same or a lower EPSS score

Access to this data is free and couldn’t be easier to access. In case you have a SCA tool, verify if this data isn't already there, since many of them, like Snyk, have this data. Or simply make a HTTP GET request to the EPSS’ API, no authentication required. Example:

curl -s https://api.first.org/data/v1/epss\?cve\=CVE-2023-37466 | jq .

{
  "status": "OK",
  "status-code": 200,
  "version": "1.0",
  "access-control-allow-headers": "x-requested-with",
  "access": "public",
  "total": 1,
  "offset": 0,
  "limit": 100,
  "data": [
    {
      "cve": "CVE-2023-37466",
----->"epss": "0.008380000",<-----
----->"percentile": "0.824310000",<-----
      "date": "2024-09-09"
    }
  ]
}

With that we can easily see that the CVE-2023–37466’s EPSS is 0.008380000, meaning that the probability of having this CVE exploited in the wild in the next 30 days is of only 0.838%.

Now let's make the exercise of plotting all vulnerabilities with CVE in a bidimensional scatterplot graphic, where the CVE is positioned according to its CVSS Score (x axis) and its EPPS Probability (y axis). Let's now draw a 45 degree line from (0,0), all the way to the top. The vulnerabilities closer to the top of this line can be considered riskier and should be prioritized first, because they have a high CVSS to EPPS combination, while vulnerabilities close to its bottom can be deprioritized. See below:

Graphical plot of EPSS Score compared to Base CVSS Score for juice-shop vulnerabilities that have a published CVE

In summary, EPSS is another excellent tool that the AppSec community can leverage in order to help prioritize vulnerabilities. In this case here, we can see that CVE-2019–10744, with a CVSS score of 9.1 and EPSS score of 02082, is risker than the CVE-2023–37466 discussed above, since it lands further away from the top of the line that we draw.

Am I even running the vulnerable code?

There’s another data point that has been gaining some traction recently on helping to prioritize vulnerability handling. Let’s pretend for a moment that we have a “math” package, version 1.0, where there’s a critical vulnerability on the “multiply” function that allows Code Injection. This means anyone running an application that depends on version 1.0 of this package would be running an application that depends on a vulnerable package. But is the application really vulnerable?

If said application indeed uses the vulnerable multiply function, this dependency should be updated as soon as possible to a non-vulnerable version. But what if the application never used the vulnerable multiply function in the first place? Although I particularly believe that this vulnerable dependency shouldn’t be ignored, since in the future the app developer might decide to leverage this vulnerable function without remembering or knowing that it is vulnerable, many believe this vulnerability should have its priority lowered when compared to an identical vulnerability that actually triggers vulnerable code, since it is less risky than the others. This concept is widely known as reachability.

See a representation of this concept below:

Representation of the Reachability concept.

In the example above, considering both vulnerabilities have the same CVE and EPSS, it makes more sense to invest time to fix the vulnerable dependency xyz 1.0 than fixing math 1.0, since the vulnerable multiply function isn’t reached by the application.

Although I couldn’t find an open source tool that enables anyone to do this kind of analysis easily, some commercial SCA scanners, Snyk’s included, can find reachable vulnerabilities and leverage this data to help its user prioritize vulnerability remediation.

What about understanding which vulnerability to fix among different applications?

So far we’ve discussed strategies of how to prioritize vulnerabilities amongst the same application. The reality is that one might be responsible for prioritizing vulnerability fixing between many different applications, some of which you may at first lack awareness of. How to prioritize them?

For this scenario let’s pretend a new high profile critical vulnerability was released, just like log4shell once was, and one is trying to figure out first, if their applications are affected, and second, what the order of priority should be when it comes to fixing it across multiple different applications.

Dependency tracking

The easiest way to figure out if an application is affected by a specific CVE, even without running it against any kind of scanner, is to know what dependencies the application has. There are many tools out there that one could leverage, including a properly named Dependency Track from OWASP or, you guessed it, Snyk.

The simplest way to achieve that, is to make sure that you have a Software Bill of Materials (SBOM) of each one of the current builds of your projects. SBOM is a big hot topic on its own and it extrapolates the goal of this article to go over it in detail. What you should know right now is that a SBOM is, in the most simplistic definition, a document that keeps track of all dependencies that are part of a given application and their versions. So one could just check if the CVE’s affected dependency and versions are part of any of their applications SBOMs.

Graphical representation of a SBOM

Application context that you didn’t know you had

Now that one knows that applications are affected by this fictional CVE, how to prioritize among them? Unfortunately, the reality is that much of the workforce involved in protecting applications is unaware of their context. But maybe they might have access to more context than they realized at first.

It’s not rare to find organizations where the development team is using some kind of Internal Developer Portal (IDP), such as the CNCF’s Backstage, originally created by Spotify. Some other examples that you might be familiar with ServiceNow CMDB, Atlassian Compass, Datadog Service Catalog, Harness, OpsLevel, etc.

This kind of platform hosts interesting metadata that might be valuable for this assessment. See the example below:

Sample page for Backstage’s Service Catalog

From a glance we can see to which System an application belongs to, its owner, or if it’s in production and running. This is huge! A vulnerability in an application that is running in production and is part of some kind of Payment system, definitely is riskier than having the same vulnerability in an application that is still in staging, and belongs to some kind of a minor system.

If you are in charge of vulnerability prioritization in your organization, make sure you check with your Dev or SRE teams if your organization is already using some kind of IDP in order to help you do your prioritization job.

Is it even in the wild?

But deployed doesn’t necessarily mean public facing. Most likely, an application that is public facing will be in a riskier position than one at is internal facing only.

Although there’s an infinitude of ways, depending on the tech stack, to verify if an application is public facing, we can use Kubernetes as an example here. See below:

$kubectl get ingress my-app
NAME HOSTS ADDRESS PORTS AGE
my-app * DNS-Name-Of-Your-ALB 80 15m

$curl DNS-Name-Of-Your-ALB
Hello World!

This way we quickly checked that this application has an Application Load Balancer attached to it and is reachable from the internet, meaning it poses a higher risk than one that isn't.

Example of two applications, dependents on the same vulnerable dependency, while just one is exposed to the internet.

Show me the Memory

What if there are more than one critical applications that are in production and public facing, which one is in a riskier position then? An additional strategy might be to figure out if the vulnerable dependency is even loaded in memory.

This isn't trivial since to get this information on demand usually means instrumenting your runtime environment and/or your application. So, if you organization is using an AppSec platform like Snyk, or maybe a continuous profiling platform, you might be able to get this information to help your prioritization task.

Conclusion

In the first article of this series, I explained how my journey to answer if there actually were more published vulnerabilities today than ever before, led me to a resound yes as an answer. But I also mentioned how it triggered my curiosity on how modern high performing AppSec teams are handling this constant stream of new vulnerabilities.

In this article, we saw how the techniques to help prioritize vulnerabilities based on Risk also evolved over time. Tools like EPPS, Reachability and IDPs can and should be used to help identify where the highest risk lies in your organization so you can focus on what matters the most: protecting your customer data.

I hope you have learned as much as I did while I researched and wrote the two longest articles I've ever written. Were you already aware of these tools? Which one do you use the most? Which one were the most surprising to you?

A Rapidly Growing Problem Named Vulnerability

Raphael Bottino — Thu, 29 Aug 2024 13:01:52 GMT

Hollywood representation of a developer/cybersecurity professional/hacker (although not wearing a hoodie, so maybe not?). Photo by Mikhail Nilov.

Talking to friends and practitioners in the space, a recurring topic is how it seems like the number of vulnerabilities we need to deal with is growing, and how challenging it has become to prioritize which of them to fix first. Although I agree with this premise, the reality is that we are only humans and maybe we are only under the impression that there are more vulnerabilities to deal with than ever. After all, we have also been busier than ever, and maybe we simply have less time to deal with this problem.

Throughout the research that lead to this article, I set out to find if this premise is really true. Are we really dealing with more vulnerabilities than ever? But finding out if this is indeed true isn't as satisfying as understanding the why behind this, so I also tried my best to justify the growth (or lack there of) vulnerabilities over time with actual data.

Are we collectively generating more vulnerabilities?

The first part of this research was a simple yes or no question: are we collectively generating more vulnerabilities? To answer this question, though, we need to get a bit in the weeds of the cybersecurity vulnerabilities. Are you familiar with the terms below? Feel free to skip to the next chart.

The first term we need to understand is of the CVE. The Common Vulnerabilities and Exposures (CVE) program, maintained by the MITRE Corporation and sponsored by the U.S. Department of Homeland Security (DHS) Cybersecurity and Infrastructure Security Agency (CISA), is a dictionary or glossary of vulnerabilities that have been identified for specific code bases, such as software applications or open libraries.

So, the CVE List is a list of all CVE IDs, and allows interested parties to acquire the details of vulnerabilities by referring to a unique identifier known as the CVE ID. It is important to note that not all vulnerabilities necessarily make their way into getting an assigned CVE ID for a variety of reasons, but the CVE List is by far the most recognized way to learn about cybersecurity vulnerabilities.

Thanks to the CVE program efforts, answering my original question turned out to be easier than I initially anticipated. They have a dedicated page for CVE metrics that includes the number of publised CVE records. It's important to know that A CVE Record contains descriptive data, (i.e., a brief description and at least one reference) about a vulnerability associated with a CVE ID.

With this free data in hands, I had it plotted in the chart below to help us visualize the number of published CVEs over time:

Graphical representation of the number of published CVEs from 2013 onwards, divided by quarter.

It doesn't take a lot of effort to see a clear pattern of new published CVE growth from 2017 onwards, so the hypothesis is clearly true. If we zoom in to the window of time from 2020 to 2023, we can see that the number of published CVEs grew over 57% on this period alone! To give us a better perspective on these numbers, in 2023 alone there were 79 new published CVE IDs a day on average. That’s over 3 new published vulnerabilities an hour!

But why?

This would have been a really lame article, and research, if it was only about confirming the hypothesis that we are generating more vulnerabilities than ever. I wanted to understand, and share with you, the why behind this growth. The reality is, though, that I can't, with 100% certainty, pinpoint reasons behind this growth.

This shouldn't stop us from speculating on some of the reasons, though. So my first hypothesis for this growth is simple: The growth on the number of vulnerabilities should be (as close to) directly proportional to the growth of lines of code developed. To prove this hypothesis, however, one needs to understand how much the codebase grew in the same period.

The challenge here is that there's no absolute way to know how much the global codebase grew in this period. We can extrapolate this information, though, using GitHub data, the most used SCM platform, as a way to approximate how much code was created.

Luckily for me, GitHub maintains a repository, called innovationgraph, that contains structured data files of public activity on GitHub itself. One of the metrics that is captured and shared with the public audience is the number of “Git push” over time. For the uninitiated, “git push” is the command a developer executes when submitting or removing code, to a remote git server, which is GitHub in this case. And although a growth on the number of “git push” doesn't mean a growth on “number of lines of code”, for this exercise, let's imagine that the average “lines of code” per “git push” hasn’t changed over time and it is positive.

With that in mind, let’s see visualize this data, plotted in the chart below:

Graphical representation of the number of “git push” in GitHub from 2020 onwards.

As I expect, codebases in GitHub grew overtime. If we zoom in to the window of time from 2020 to 2023 again, we will see something interesting: the codebase grew virtually over 40% in this period, which is significantly lower than the increase of 57% of vulnerabilities for the same period. This goes against my hypothesis that the growth was (close to) directly proportional between vulnerabilities and codebase size. But why such a difference between growth rates?

Again, I can only speculate here. The cybersecurity discipline matured a lot in the past few years and we can look at this number with a positive spin, congratulating ourselves for collectively improving in detecting and disclosing vulnerabilities. Or, if we are the glass-half-empty-type, we could be lamenting on how the code quality, when it comes to security, has decreased.

Quick sidetrack: I want to add a touch of personal opinion here. It's almost impossible to find content today that doesn't speak to Generative AI (GenAI), and this one won't be different. If lower quality code is indeed the reason why we saw a disparity between growth of vulnerabilities vs growth of "git push", we can speculate on the impact of GenAI on the future of vulnerabilities. As you might know, GenAI is trained on existing content to generate new content, and developer’s copilots — GenAI agents purposefully built to generate code — are not different. So, if code quality is lower than ever, copilots will generate lower quality code as well. Worse: more code is then pushed to repositories, which will eventually lead to faster growth of (low quality) codebases and, as a consequence, of vulnerabilities.

Deployment Frequency

Vulnerable code isn't an actual vulnerability only because the codebase was changed. A vulnerability only exists when said vulnerable code is part of a new version of a software or open source library, so it needs to be deployed first. With that in mind, I decided to do research on modern deployment frequency to understand if that could also impact the number of vulnerabilities.

To help me understand how the deployment frequency changed in the last few years, I resorted to DORA, the DevOps Research and Assessment group. DORA has 4 Software delivery performance metrics that many organizations leverage to measure their own efficiency and maturity level when it comes to delivering value to their customers.

One of these metrics is the Deployment Frequency, or how often an organization can successfully deploy to production, exactly the data I need to answer the question I had in mind. Every year since 2014, with the exception of 2020, DORA released the Accelerate State of DevOps Report, that includes the results of a benchmark assessment of DevOps performance across hundreds of organizations, so this data can help us understand how they evolved over time.

Although organizations are divided by Elite, High, Medium or Low performance based on certain criteria, simply showing an evolution of the percentage of organizations spread across these tiers over time wouldn't be enough, as the criteria for each tier also evolved over time. Instead, I've plotted a chart below that represents the percentage of organizations capable of deploying multiple times a week, over the years:

Graphical representation of percentage of organizations deploy at least once a week from 2021 onwards.

As expected, despite a dip in 2022 that DORA theorizes can be a consequence of the pandemic, there was a growth from 2021 to 2023 of over 88% of the number of organizations deploying software more than once a week. This adds an increased dimension to the AppSec practitioner: they not only need to deal with more vulnerabilities than ever, and codebases larger than ever as we saw before, but they also need to deal with deployments faster than ever as well. So they have less time to assess the code quality in order to not slow down this fast deployment pipeline!

But we have an increased workforce… Right?

So far, we proved that the number of vulnerabilities grew, that there are more code changes to analyze than ever and that organizations are deploying to production on an increased rate. The impact of all these changes could be minimized, however, if we had more people working at protecting these applications.

Trying to get good data, however, on how much the Cyber Security job market grew, let alone the AppSec market did, in the last few years, proved itself as a tough question to answer. Getting this data for the American market, however, turned out to be easier. Using the website CyberSeek as reference, I captured the data and plotted the following chart, that includes both openings and filled positions in the Cyber Security job market in the USA:

Graphical representation of the number of Cyber Security positions in the USA from 2020 onwards.

Unfortunetally, and as probably one would expect, the Cyber Security job market, at least in the US, only grew less than 10% from 2020 to 2023. Of course these aren't global numbers, but I believe it's a good representation of the global growth.

Nevertheless, that means that number of people working on the securing applications everywhere didn't grow as fast as the number of vulnerabilities.

Conclusion

When I set out to answer the question if there were more published vulnerabilities today than ever before, I must confess that I expected the answer to be "of course yes". But I wanted to be sure. As we saw, that's exactly what happened. But we couldn't stop there, we had to understand the why.

As we saw, developers are pushing code more frequently than ever as well, while organizations are maturing their development processes to also deploy these changes in a frequency that just a few years ago would be unimaginable to most.

All this means that we need to be smarter. Trying to deal with this influx of vulnerabilities at the speed of DevOps doesn't allow us to use the same processes and tools that we have been using so far. In part 2 of this article, we will discuss how modern tools like SCA, EPPS and diverse techniques can help us minimize the impact of the growth of vulnerabilities and of deployment speed in our AppSec programs.

Stay tuned.

Get to know your AWS Managed Policies

Raphael Bottino — Fri, 30 Sep 2022 01:36:49 GMT

Understand what an AWS Managed Policy is and how a simple step can ensure you are using the appropriate one for your need

"A studio photo of an IT professional wondering while facing their computer" according to DALL-E

Have you ever googled AWS managed policies list? What about AWSLambdaExecute statements? Recently, once again, I found myself in a similar situation.

But let's start from the beginning.

What is an AWS Managed Policy?

From its documentation:

An AWS managed policy is a standalone policy that is created and administered by AWS. Standalone policy means that the policy has its own Amazon Resource Name (ARN) that includes the policy name. (…) AWS managed policies are designed to provide permissions for many common use cases. (…) AWS managed policies make it easier for you to assign appropriate permissions to users, groups, and roles than if you had to write the policies yourself.

In short, it is a special kind of IAM Policy that is curated and maintained by AWS and enables you to move faster, focusing more on your code and less about permission, leaving the latter to the pros at AWS.

But how do you know if the service you are working with has a managed policy that you can use for your benefit?

Service Specific Managed Policies

Reading the documentation, of course! Let's say the service in question here is AWS Lambda. A quick google search reveals the "Identity-based IAM policies for Lambda" page. There, as you can see below, three different managed policies are suggested:

Let's say you now need to use Amazon Polly so your awesome bot can have an Alexa-like voice. Again, a quick search will take you to its documentation, which lists two managed policies:

Let's move to something more complex and powerful, like theAWS Systems Manager, a service so comprehensive that it almost feels like multiple services in one. Googling will show you there are multiple SSM related AWS Managed Policies to use. What are the statements of AmazonSSMPatchAssociation for instance?

You don't need to exercise your Google-fu

If you know what managed policy you need more information on, you are good: an AWS CLI is all you need. And a bit of copying and paste.

First you run aws iam get-policy --policy-arn arn:aws:iam::aws:policy/AmazonSSMPatchAssociation. See the below:

https://medium.com/media/99d6e5f109362cee8fafa84233592060/href

Take note of the DefaultVersionId value, v1 in this example. Now, we run aws iam get-policy-version --policy-arn arn:aws:iam::aws:policy/AmazonSSMPatchAssociation --version-id v1:

https://medium.com/media/bc78ab0940f86db63f8bddf9a7104111/href

Now we have what we were looking for, the Managed Policy statements. With that information in hand, we can make an informed decision aboutthis Policy matches the use case requirements.

Pro Tip

If you do that enough, this can quickly become a tedious process. So let's fix that. Below you can find a Bash function that takes an AWS Managed Policy name as a parameter and outputs all the information that you might need.

https://medium.com/media/6b1862401295e64c52edfc94046b9643/href

Another possible solution is to get access to all (currently) 973 AWS Managed Policies. The GitHub user Gene Wood was nice enough to write a gist with that list and the code he used to generate it. He also provided us with his code on how he generated it.

There is a problem, though. AWS is always releasing new services and features and this list was last updated almost 3 years ago. How can one have an always up-to-date list of all AWS Managed Policies and all of its statements?

Search No More

So I don't go over this pain again, and so others can also avoid it, I hacked together a simple website that, once a day, updates istself to make sure you have an accessible and updated list of all AWS Managed Policies right from your browser.

Introducing… awsmanagedpolicies.io!

awsmanagedpolicies.io is a simple-to-use, always up-to-date, accessible site that lists all of the AWS Managed Policies in a simple way, with a simple-but-it-works search bar to filter down the list

If you click any of the entries, it expands to show you the definition of said AWS Managed Policy:

If you just need a JSON file that always has the most recent list of AWS Managed Policies and its definitions you can bookmark this link instead!

Architecture

Of course, this website is 100% built on top of AWS and 100% Serverless! Its infrastructure was defined using CDK (TypeScript), and it contains, among other things, a lambda to fetch the latest AWS Managed Policies, an S3 Bucket to host the files, and a CloudFront distribution to serve the content to you.

As soon as I publish its code on GitHub and write an article on how was it to develop the site and how it works, I'll update this article with the links.

Conclusion

AWS Managed Policies are a great way to kick start your newest project. However, always make sure you are using the appropriate one. The best way to do it is verifying its statements via CLI or through the website awsmangedpolicies.io.

Full Disclosure

I decided to finally buy the domain, finalize the website, and write this article a few days ago. I started this project a year and half ago, give or take. Little did I know that today, there is an amazing solution to this problem, created by the AWS Hero Ian Mckay, called aws.permissions.cloud.

I highly recommend going to his site if you need more information than just a list of the AWS Managed Policies and their definitions, but also metrics like how many AWS Managed Policies are there, if a policy might expose a resource to the public, etc.

How to Hit AWS Step Functions Limitations…

Raphael Bottino — Fri, 03 Jun 2022 13:36:15 GMT

…and how to overcome them.

This is part two of a two-part series of my learnings as a first-time user of AWS Step Functions. You can find the part one here.

TL;DR

I was able to implement a better architecture for my application and make it work. Until it didn't with buckets that had enough objects because I wrote a bad recursion. Fixing that, I realized AWS Step Functions and Lambda have yet another limitation that I wasn't aware of. But I got it fixed.

Curious? Keep reading.

Introduction

After hearing the feedback of a few readers from the last article, and finally having some spare time on hand, I finally set myself to try and implement a proper solution for my challenge. As a refresher, I had a lambda that would run for too long and, as a consequence, would time out fairly often depending on the input. It was a prime candidate for me to implement the same logic using AWS Step Functions — and a great excuse to finally use the service.

Revisiting the Proposed Architecture

What I had in mind by the end of the last article was to create an architecture similar to the one below, where I have 2 different workflows in AWS Step Functions:

The architecture proposed in the previous article.

The first one, list all objects inside a bucket. Then, for every hundred objects, it would invoke the second workflow. Then, the second workflow would generate a pre-signed URL for each of the objects in the input array, and push it to a queue.

However, when I started to implement it, I decided to go with a different approach. There would still be two Workflows, but they would work slightly differently than originally proposed.

First Workflow: Starter

Above you can see the first workflow. It is quite simple, actually. I call it the Starter workflow, since it’s the first one to run, and all it does is list all keys in a bucket. Then it starts the second workflow using this array with all keys as input.

Second Workflow: The loop.

This is where things get interesting. To avoid running into the previous problem of reaching the maximum number of historical events (see the first article). The first step of this Workflow is to select the (up to) first 500 keys in the original array because I know I won’t run into this problem with that many keys from previous tests.

Then, in parallel, two distinct logics are executed. On the left side of the diagram, for each of the up to 500 keys, we have exactly what we had before: a lambda that takes the key generates a pre-signed URL, and pushes it to a queue. On the right side, there is a Choice State that checks if the original array, minus the 500 keys that are being processed, still has any keys left. If there isn’t, that’s pretty much it. However, if there are, it will execute this second workflow all over again for the remaining keys. All that means this won’t ever hit the historical events limit and, as a bonus, there is a lot of concurrency going on, speeding up the process.

After some time trying to implement the two new workflows, I went over some challenges, such as getting the Choice State wrong and getting the workflow to always call itself again, getting the recursion of the State Machine into an infinite loop. But, after some coding, and a few more mistakes, I got it done.

I did it! Or did I?

That was it. I did it. I was excited that I finally did it. I right away sent my code to Felipe, a friend that has a big interest in this, to have him test it in his account, so I can be sure I won't run into a new take on the old classic: "But it runs on my AWS account!"

But I did. It didn't work in his account.

I knew there were more objects in the bucket he used for testing the code than I had on mine, but I couldn’t understand why it would not work. After all, my new Step Functions filters just the first 500 keys to work with and that was the only issue I found beforehand. When troubleshooting the execution, I realized the second workflow was never triggered, so the problem was in my simple lambda to list all the keys.

https://medium.com/media/b62bba791fe1f2410d718920df990123/href

The code is fairly straightforward, but I did something wrong here. I used recursion poorly. As I mentioned in the previous article, each API call returns a page with a thousand objects. If I need more, I need to make the same call, passing the previous call’s NextContinuationToken as the ContinuationToken parameter. So I was calling the function over and over again, stacking them on each other and… for a bucket with enough objects, using all the memory allocated to my Lambda function, which was blocking it from moving along.

https://medium.com/media/41a7ffdf6e86b62799fa0a7ab1b27161/href

After changing the original code to the above, I removed the recursion and Felipe tested it again. And this time it didn't fail! At least not because the function was using all the memory available…

Yet another limitation

There I was finding yet another limitation. Both Step Functions and Lambda (for asynchronous requests) have a limit of 256Kb for their payload.

This payload is too big for this lambda function

The array of keys generated by the Function had so many keys that it was bigger than 256KB, breaking the continuity of the workflow. Again, just like all of my challenges so far, that’s on me. RTFM.

AWS has a recommendation for Step Functions that require passing large payloads around. Just don’t. Instead, AWS recommends using S3 to save the payload and pass the object arn around for the next step in the workflow to read the payload from there. I quickly changed my code to use this approach and, finally, the code works as expected!

Conclusion

I learned so much getting this code to a state that I am comfortable sharing it with my peers, but I also wasted a good amount of time just because I decided to do it instead of first reading at least a bit of the service documentation.

I still highly recommend using AWS Step Functions IF you are comfortable with these limitations and the workarounds to make your code work and maintain it long-term. I also recommend reading its best practices before you attempt writing your first line of code. Also, as I was writing this article, AWS released a Step Functions workshop that looks really promising.

Are you feeling more comfortable with AWS Step Functions after this 2 part series? Are you ready to start using it or do you think you should just use Lambda? Let me know in the comments session!

How to Hit AWS Step Functions Limitations… was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to use, and NOT use, AWS Step Functions

Raphael Bottino — Fri, 20 May 2022 02:59:50 GMT

TL;DR

I had a problem with a Lambda Function timing out and decided to give AWS Functions a spin to solve my problem. I found an incredible service, but also hit some limitations, that you might hit one day as well.

Curious? Keep reading.

Introduction

Just recently I found the perfect excuse to finally try out Step Functions beyond running some kind of Hello World. Trend Micro, the company that I work for, has a pretty interesting solution called Cloud One File Storage Security, or FSS for short. This is a technology that scans S3 objects for malware as soon as they hit the bucket. Then, it allows you to do anything with that scan result downstream, from tagging as malicious to promoting the clean files to a different bucket. In summary, a great security tool for a builder to have in hand when there is a need to answer about compliance.

However, many customers aren't builders and just want to answer a simple, yet hard-to-answer question: "Is there any malware in my S3 buckets right now?". The usual answer has been "Simple! Move/copy all your objects from one bucket to another, so a scan is triggered and an answer is given.". Sure, it works, but it's far from great. Then we have a problem: how to scan every single object in a bucket without asking the customer to move objects around?

My original approach

FSS works in a really neat way, and you can read more about its architecture in its public docs. The key part to understand is that there is an AWS Lambda Function that keeps listening for S3:ObjectCreated in a specific bucket. Whenever an object is uploaded to this bucket, triggering said event, the function is invoked, creating a pre signed URL for the new object and sending it to a "Scan queue".

Oversimplified FSS Architecture

So my solution to the problem was darn simple: let me write some code that loops through all the objects in the bucket and, for each, it pretty much does the same thing: creates a pre signed URL and pushes it to the SQS queue.

Problem solved, it worked beautifully. however, some users started to report that many files would remain unscanned. After quick troubleshooting, it was easy to find the issue. It turns out the function was timing out, despite setting the function timeout to 15 minutes, AWS Lambda's highest limit. The process of generating a pre signed URL and pushing a message to the queue takes a few seconds per object, and when there are a few hundreds of them, it's easy math to see that 15 minutes aren't enough.

Rearchitecting for AWS Step Functions

If you are not familiar with AWS Step Functions, it "is a low-code, visual workflow service that developers use to build distributed applications, automate IT and business processes, and build data and machine learning pipelines using AWS services. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.", all according to AWS itself. The highlighted parts were on me, exactly to highlight how perfect this service would be for my use case. I can decouple the logic of my one lambda that often times out, first listing the bucket objects and then distributing the result of that to parallel function executions that will, each, generate a pre signed URL for one object and send it to the queue.

It gets better. Relatively recently, AWS Step Functions released a pretty neat feature, the possibility of integrating directly to other services without writing any code. I can have my workflow list all the objects in the bucket and return me an array, again, all without writing any code.

After dragging a few blocks and trimming down my code to work for just one object, a process that took me just a few minutes, my new application was ready. I was in awe of how easy it was.

The Workflow for my new bucket scanner.

To my surprise, it just worked. And it was fast, way faster than what I had previously. But the title of this article is about how to also NOT use AWS Step Functions, so you can imagine where this is going… I had a problem.

This time, to take no chances, I tested my code not with a few hundred objects, but with thousands instead. If you are familiar with AWS APIs, or pretty much any API for that matter, you must be familiar with the concept of pagination (if you are not, there is this easy to read article by developer Johanne Andersen), and calling the S3 API through a Workflow is no different -it first returns the first page (duh), which is limited to 1k objects. The problem is the fact that there is no way to paginate further using the service! Meaning, that in my case, I was able to start a scan for only the first thousand objects, nothing else. After some research, I found out that even AWS Hero Ben Bridts added support for pagination in his wishlist for the service, so I'm not the only one missing the feature.

Dropping Direct Service Integration

AWS Step Functions was so easy to work with, and a big improvement from my original implementation, that I decided to stick with it, changing the approach slightly. Instead of using the integrated API call provided by the service, I came to the brilliant idea of writing my own lambda function to just return the list of objects, paginating through all of them. AWS' SDK is nice enough that automates the pagination on my behalf, making writing this code fairly easy.

Second attempt.

Again, it was surprisingly easy to change my workflow to work this way. And, right away, I could see that the first step was returning the bucket's entire list of objects. It is going to work. A few thousands of objects into the "For each object" map, the workflow stopped with a fail:

The execution reached the maximum number of history events (25000).

What does that even mean? It looks like, each workflow execution has a hard quota of 25 thousand entries in the execution history, and there is no escaping that. At least not an escape that is as simple as using AWS Step Functions itself. The documentation suggests implementing a pattern that uses an AWS Lambda function that can start a new execution of the state machine to split ongoing work across multiple workflow executions. So I can't go beyond the few thousand objects without rearchitecting the workflow.

Conclusion

AWS Step Functions is a brilliant, easy-to-use, low learning curve service and the fact that I went over challenges that I had aren't an indication that I don't recommend its usage, because I do. I started using it before first trying to understand its limitations and hit some of them, but even then I drastically improved the readability, performance, and overall capability of my code in a short amount of time.

By the way, If you are as curious as I am and want to know how I solved this problem, the answer is that I haven't yet. Whenever I have some throughput, the architecture that I want to follow looks similar to the one below:

Propose Workflows to solve the original problem

I believe the proposed architecture would avoid both the pagination and historical events limitations. In the meantime, however, to make sure Trend customers can run a full scan of their buckets, I wrote a Python script that runs locally on their computer, really similar to the first AWS Function flow that I presented.

Have you ever had a similar challenge? Have you ever used AWS Step Functions or are you a guru already? What do you think of my proposed solution? I'd love to hear more!

Note: About not being able to paginate over the API results using the AWS Step Functions service integrations, the cloud engineer Thomas Laue pointed me to a Steven Smiley article that goes over how to handle the pagination inside of the workflow.

Update

I published a follow up to this story right here, where I go over the new approach and other limits that I hit in AWS Step Functions.

How to use, and NOT use, AWS Step Functions was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

10 Tips to Kick Start Your Cloud Career

Raphael Bottino — Mon, 14 Jun 2021 21:01:25 GMT

Skycrafters insights on how to start your cloud career sooner than later

The cloud career winding road

Are you ready to join the Cloud Computing market? It doesn't matter if you are new to IT, or if you are coming from the datacenter world, here you can find a few tips that I put together. You don't need to trust me though, as this is a compilation of a chunk of the collective knowledge of the Skycrafters community — and more details on what Skycrafters is, later, because being aware of it is a great tip on its own.

1 — How Certifications are Perceived 👀

Of course, first, we are going to debate certification. After all, it usually is a common target for those new to a space to get to learn more about it.

There is a long debate in our forum about whether or not you should pursue certification as part of your cloud career roadmap. Based in the discussion, it’ s clear that holding a certification doesn’t necessarily show that you are really knowledgeable in the content on itself, but it for sure shows eagerness to learn. It’s common sense that certifications are no substitute to hands-on experience, but the goal of achieving one might be a great way to motivate yourself to learn a new skill — and also a great way to open doors.

Keep in mind that a certification won’t necessarily make your resume shine brighter than others, but it might actually be a requirement for the role. Watch out, though. If you start to collect certifications, don’t share them all publicly, like in an email signature or LinkedIn, as it might come off as bragging!

2- Preparing yourself for the Certification Exam 📚

No we go to step 2. Once you have decided to pursue a certification, you’ll have countless hours of study ahead of you and a training will come in hand to give some structure around it. An interesting question, also previously discussed by Skycrafters, if if there is any difference between taking your training in person or online. With some parts of the world slowly going back to business as usual (as much as possible), in person learning is now, once again, a possibility. But remote training is always available, no matter where you are! Which one works best for you? Let’s go over some of the pros and cons of each.

Online 💻

Pros👍:

Being able to playback content faster or slower depending on familiarity with the topic
Easier, since it can be done from anywhere
Sometimes, free

Cons👎:

Easy to have your mind wandering to something else (after all, your Slack and phone are right there!)
The loneliness of not having someone to discuss the coursework with on a daily basis

In Person 👩‍🏫

Pros👍:

Closeness to other people to exchange ideas and thoughts
A live instructor to consult with in real time

Cons👎:

Harder to deal with boring topics/classes
Usually more expensive

Which one is the best? This is a personal choice based on how you value each of the bullet points above. I’d personally pick in person training anytime, because I get easily distrac… Sorry, I was checking my phone. Where were we again?

3- Be aware of vendor lock-in in the Cloud 🔒

I personally think this tipo really interesting. As someone that is learning a new technology, especially if you are aiming at a particular certification, it’s easy to get 100% focused in just one Cloud Service Provider for a while, like AWS for instance, and that’s fine. Most cloud concepts are interchangeable between providers. To prove the point, you can even check official Azure documentation on how their services compare to AWS’.

Just make sure you don’t ignore that there are other options in the market, which have the potential to perform better for particular use cases than your provider of choice. Companies know that, which leads us to the next tip…

4- The dream of cloud-agnostic 💭

Cloud-agnostic. You are going to hear this term a lot over your cloud career. However, the story isn’t as simple as most like to paint. Vendor lock-in is a true challenge that the Skycrafters community has been discussing, and, to many, it’s simply a pipe dream. A dream because a truly cloud-agnostic environment would be able to run on any provider’s environment, or even locally to your computer/data center/cloud. A true cloud-agnostic architecture has the potential to enable organizations to pursue lower costs, faster time to market, access to state-of-the-art technologies and avoidance of the issue mentioned in the previous tip, lock-in. However, there are benefits beyond the technical, like a “get out of jail card” in case your current vendor becomes a competitor in your space with a brand new service release.

Skycrafters consider It a dream, however, because it can be expensive — in different ways — to run and operate such a particular workload. If you try to escape from using Amazon SQS, for instance, you might want to leverage open-source solutions like RabbitMQ, which is awesome! However, now you and your team need to deploy and maintain a new stack of the platform that isn’t directly delivering value to your customers. Cloud computing is all about taking the most of the Shared Responsibility Model, and running your own infrastructure services isn’t the way to maximize it.

5- Kubernetes to the rescue? ⛴

Kubernetes is viewed as a way to minimize vendor lock in through its open architecture. It can take servers, no matter if running on AWS, Azure, GCP, on premises, etc., and transform them in computing capacity for this big cluster running spread across all of them. But it isn’t a bed of roses either, according to those that have experience on it. Running it yourself can be really painful, and exactly why most providers also offer their own flavor of Kubernetes-as-a-Service. But wouldn’t using it make you go back to the lock-in stage? Don’t use it just because it’s a hot topic, as many do. If you ever learn and use it, make sure it’s a solution that tackles your pains and solve your challenge.

The goal here is to make clear you understand you shouldn’t create a vendor–lock-in situation for yourself. Make sure you don’t just understand how to use your provider of choice services, but also the reasoning behind it and its concepts.

6- Infrastructure as Code 📄

Now that you dominated all the cloud knowledge you were seeking and learned all the cool stuff that your cloud provider of choice has to offer (this was a joke, you won’t ever know it all and that’s fine!), you also need to learn that you rarely are going to use its dashboard to build anything for real. The dashboard is great for labs, tests, demos or to learn something new, but not for production environment. Production environment requires predictability, agility, consistency, minimization of risk and reproducibility. If you try a thousand times to create a simple S3 bucket in AWS via its dashboard, it’s almost guaranteed that you are going to make a mistake at least once and, even if you nailed it, it would take you a lot of time. Hence, Infrastructure as Code(IaC).

IaC is a way to describe your infrastructure as, you guessed it, code. As much as software is defined in lines of code, so is the infrastructure. You can write code that can define a thousand different S3 buckets and, as you execute it, you would reliable and quickly have a thousand buckets. No mistakes made.

To give you an better idea on what an IaC would look like, here’s a quick example:

Resources: 
  S3Bucket: 
    Type: 'AWS::S3::Bucket' 
    Properties: 
      BucketName: MY-REALLY-COOL-BUCKET

As our members previously discussed, there are a lot of different IaC flavors, some native, some open-source, some multi-cloud. To name a few that you might want to check out, we have:

CloudFormation — AWS Native and YAML/JSON based
Azure Resource Manager(ARM) — Azure native and it has its own DSL (Domain-Specific Language)
Terraform — Open-source, multi-cloud and it uses its own DSL
CDK — Newer AWS native offering that is open-source and you can code using your favorite programming language
Bicep– Newer Azure native offering that is open-source and has its own DSL
Pulumi — New open-source offering, that is multi-cloud and can code using your favorite programming language

And since IaC is easily replicable, you can take your time to build one really well crafted and documented template and reuse it across your projects, organization, and even publicly share with other members of the community! Which brings us to the next tip…

7- Best practices in the cloud ✅

This is another great topic our members are discussing. Building confidently in the cloud can be challenging. Often, we use a technology and find out later that we could have been using it better. There are hundreds of different ways to build in the cloud using the providers’ services. And, despite the default configuration for many of them being “good enough”, “good enough” often times doesn’t cut it.

That’s exactly why many providers offer their set of best practices, usually called Well Architected Framework, or WAF for short. Taking AWS as example here, their WAF is divided in five pillars: Cost Optimization, Operational Excellence, Security, Performance and Reliability. Each pillar has their own set of white papers that explain thoroughly how to achieve the state-of-the-art usage of their services, while understanding the balance between the five pillars.

As it can take a while to build well-architected architectures for your projects, the combination of WAF with IaC is really powerful. Whenever you write your own IaC templates that build well-architected architectures, you can reuse them across your applications, saving time and bringing your environment to the forefront of what the cloud providers can offer.

8- Hybrid Cloud ☁️

Hybrid cloud is a really hot topic right next to multi-cloud that our community is debating. First, let’s take NIST’s definition for it:

The cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds).

Although NIST’s cloud computing definition is taken as the de facto way to describe it, the hybrid cloud approach is mostly used when there is a combination of one or more public providers and a private cloud to support an organization IT needs.

This is particularly interesting for organizations that have restrictions on how they process certain types of data, like banks. This allows them to leverage the public cloud to easily and cheaply scale as needed while maintaining the costumer’s data local to their data center.

It’s important to note, however, that being able to pull this kind of scenario off is really challenging, since it can be particularly hard to separate the data access through Identity Access Management across clouds. Also, dedicated links between the provider data center and your organization can be necessary because of the latency introduced by internet access, and they can be quite expensive.

9- Complement your studies with Podcasts 🎧

This is probably the easiest and funniest tip. You probably can’t spend the entire day studying, at least I’m sure I can’t. At some point you are going to find yourself doing something manual and boring that doesn’t take much of your brain power. Riding the subway or driving a car to the office (remember those?) is a good example, but it might also be cleaning your place or mowing the lawn, it doesn’t matter. This is the perfect time to expand your cloud skills by listening to a good podcast. Our members have compiled a good set of podcasts that I’m happy to share with you here:

Screaming in the Cloud, by Corey Quinn
The Idealcast with Gene Kim
Talking Serverless, by Ryan Jones
The New Stack Makers, by The New Stack
Mik + One, by Dr. Mik Kersten
Girls In Tech podcast
Blinkist
Cloud Security Podcast

Keep in mind that not all of them are necessarily cloud related, but they might help you develop other skills and ares of expertise to help you in your cloud career. After all, technical ins’t binary and the cloud space isn’t just for those in the far–right spectrum of it.

10- Skycrafters are here to help❤️

Cloud computing can get challenging really quickly, but it can also be exciting and fun. Especially, when you have a number of peers to work and innovate with. It doesn’t matter if you are a seasoned cloud practitioner, or if you are just starting out, Skycrafters can be a place for you to network, find answers quickly and bounce ideas off other members while learning in the process.

Skycrafters is home of great curated content, amazing open-source code to use or contribute to, and a safe place where I hope you can grow your cloud career and skills.

What are you waiting for? Skycrafters is 100% free, no gimmicks, and joining it can be a stepping stone for your cloud career and those around you.

A version of this post was originally posted at: https://skycrafters.io/blog/2021/06/01/kick-starting-your-cloud-career/

Amazon AppFlow — How to Leverage It

Raphael Bottino — Wed, 29 Apr 2020 03:47:54 GMT

Extra! Extra! Amazon AppFlow is Released

Have you heard about Amazon AppFlow? It’s a brand new service from AWS that allows you to easily integrate SaaS applications such as SalesForce and Marketo to AWS services, such as S3 or Snowflake.

Look: Amazon AppFlow logo!

Yet another day, yet another new AWS release. Even on challenging times with the current Covid-19 pandemic still slowing the global economy down, AWS shows that it is on full-throttle mode and released a new service last week called Amazon AppFlow.

What is it?

Quick summary on how Amazon AppFlow works.

Pretty much paraphrasing the announcement, Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between Software-as-a-Service (SaaS) applications and AWS services, in just a few clicks. As pretty much everything else on AWS, with AppFlow, you can run data flows at nearly any scale at the frequency you choose, paying just for flow run and data processed, with no upfront charges. For those that are security-aware (if you aren't, you should!), AppFlow automatically encrypts data in motion.

…and what does it mean?

It means 0-time invested to learn both the source's and destination's API. With a few clicks you can, for instance, backup all customer support cases from SalesForce to S3 on a weekly basis or daily push a list of new Leads from Marketo to AWS SnowFlake, allowing your team to quickly understand your leads behavior, all with no coding required.

But since I said that you should be security-aware, you might be thinking… "Can I leverage that for security purposes?" Yes! You can leverage this non-security related service to help you with your security.

Using it for Security

Trend Micro is the only Security vendor to be an AppFlow launch partner, which allows AWS and Trend Micro Cloud One customers to create AppFlow flows using Workload Security data as input, easily moving data from this security service to different destinations.

OK. AppFlow looks cool. The SalesForce and Marketo examples look cool. Having a security vendor like Trend Micro being a launch partner also looks cool. But how to use it?

Let's get our hands dirty

Of course, being the technical-curious person that I am, reading the release notes and examples are definitely not enough. I need to get my hands dirty. So, feel free to follow me on this journey.

Creating our First Flow

First, of course, let's hit the AppFlow dashboard.

Amazon AppFlow dashboard

If we click the bright orange "Create flow" button, we will be taken to the first step on creating our first flow. For this flow, I decided the name would be "CloudOneWorkloadSecurity-Computers" and I moved to the next step, without setting any of the optional settings.

Step 1. Really easy so far.

On Step 2 we can see exactly where AppFlow shines. I picked Trend Micro as Source and all it requires to be able to fetch data from Cloud One is an API secret. Again, no coding required.

Did you expect to see my API secret here?

As soon as I add my API secret, AppFlow presented me with the different object options that it can retrieve from Cloud One. For launch, only "Computers" and "Policies" are available, as you can see below, but we should expect to see more options later down the road.

Object options

Then I picked "Amazon S3" as my destination, deciding on my bucket and a prefix to the objects.

Step 2. Still easy!

Now we move to Step 3. Clicking on the drop-down "Choose source fields", we can decide on which fields we care about for this flow. I clicked first on "Map all fields directly", but because Cloud One is so thorough, I quickly realized it had way more information than I needed for this use case. So I selected only the 9 fields that I care about.

1, 2, 3… 9 fields!

On the following step, I could pick to run the flow on demand or to set a schedule for it. I decided, for this example, to run it daily.

Step 4. I can't believe it is that easy.

And that's it. The flow is ready to be used. And so I did.

Done!

In a little bit over 10 seconds, AppFlow fetched my Computers info from Cloud One Workload Security and dumped to a S3 bucket.

Details on the flow execution.

Clicking the "View data" link, it takes me straight to the bucket, where I can see the lonely file there. Downloading it shows me exactly what I expected, information taken straight from Cloud One.

Data straight from Cloud One

Houston, we have a problem…

There is a problem with that, though… There isn't a ton of value on this flow on itself, plus, my hands didn't get that dirty. If you just wanted to know what AppFlow is and how to use it, the article ends here for you. Thanks for stopping by! If you, like myself, like to get your hands dirty, let's move to the next stage.

Working with the data

After the daily run of this flow, I want to work with the generated data — automatically, as soon as it hits the S3 bucket. The idea is to go trough the generated data, process it and write to another bucket. For this example, I decided to daily generate a JSON compatible array of computers that the current state is different from "active", which means they probably have some kind of connectivity issues with the Cloud One manager. The final result is something similar to the diagram below:

The diagram below.

Before we go any further, it's important to note that the project — which has its code available on my GitHub — has its infrastructure built using AWS CDK (Typescript), while the Lambda code was built using JavaScript. If you are not familiar with CDK, I highly recommend the CDK Workshop documentation.

CDK stack code.

The code above describes the project infrastructure, generating a CloudFormation stack with a destination S3 bucket, a Lambda function and the proper permissions. Since I wanted to trigger this Lambda as soon as the source bucket received the data, I tried for a while to add this trigger to the code with no success; until I remembered, of course, that I wouldn't be able to do it — CloudFormation doesn't support adding event triggers to existing buckets.

After creating the infrastructure, I went ahead and coded the last missing piece: the Lambda itself.

The Lambda function code

The code is pretty straight forward. First, it downloads the newly added data to the Lambda execution environment. Then, it works the data. Since the original file has a JSON-described computer per line instead of an array of objects, I trimmed the file (to remove any white spaces from the end of it) and split it into an array of strings. Since each string represents an object, I mapped the array to return the objects that each string represents and, then, filtered out all objects where the state is active, since they are not relevant for us. Finally, all the non-active computers were written to the destination bucket.

After deploying the above stack, the last step is to manually connect the source bucket to it. Go to the bucket properties, click on Events and create a "All object create events" notification to it. Make sure to select the newly created Lambda to receive the notification. Now, for every AppFlow run, this lambda will also be triggered.

Bucket Events.

If you run the flow manually again to test the environment, you should see a new file on your new bucket, listing only the Cloud One computers that currently aren't on "active" state.

Resources:

[1] https://aws.amazon.com/new/

[2] https://aws.amazon.com/blogs/aws/new-announcing-amazon-appflow/

[3] https://docs.aws.amazon.com/appflow/latest/userguide/what-is-appflow.html

[4] https://blog.trendmicro.com/trend-micro-integrates-with-amazon-appflow/

[5] https://github.com/raphabot/AppFlowWorkloadSecurityDemo

[6] https://cdkworkshop.com

Shift Well-Architecture Left. By Extension, Security Will Follow

Raphael Bottino — Mon, 13 Apr 2020 16:58:14 GMT

A story on how Infrastructure as Code can be your ally on Well-Architecting and securing your Cloud environment

Using Infrastructure as Code(IaC for short) is the norm in the Cloud. CloudFormation, CDK, Terraform, Serverless Framework, ARM… the options are endless! And they are so many just because IaC makes total sense! It allows Architects and DevOps engineers to version the application infrastructure as much as the developers are already versioning the code. So any bad change, no matter if on the application code or infrastructure, can be easily inspected or, even better, rolled back.

For the rest of this article, let's use CloudFormation as reference. And, if you are new to IaC, check how to create a new S3 bucket on AWS as code:

https://medium.com/media/3f635e6744cf32313846ab1fd7762bcf/href

Pretty simple, right? And you can easily create as many buckets as you need using the above template (if you plan to do so, remove the BucketName line, since names are globally unique on S3!). For sure, way simpler and less prone to human error than clicking a bunch of buttons on AWS console or running commands on CLI.

Well, it's not that simple…

Although this is a functional and useful CloudFormation template, following correctly all its rules, it doesn't follow the rules of something bigger and more important: The AWS Well-Architected Framework. This amazing tool is a set of whitepapers describing how to architect on top of AWS, from 5 different views, called Pillars: Security, Cost Optimization, Operational Excellence, Reliability and Performance Efficiency. As you can see from the pillar names, an architecture that follows it will be more secure, cheaper, easier to operate, more reliable and with better performance.

The 5 Well-Architect Framework Pillars

Among others, this template will generate a S3 bucket that doesn't have encryption enabled, doesn't enforce said encryption and doesn't log any kind of access to it–all recommended by the Well-Architected Framework. Even worse, these misconfigurations are really hard to catch in production and not visibly alerted by AWS. Even the great security tools provided by them such as Trusted Advisor or Security Hub won't give an easy-to-spot list of buckets with those misconfigurations. Not for nothing Gartner states that 95% of cloud security failures will be the customer’s fault¹.

The DevOps movement brought to the masses a methodology of failing fast, which is not exactly compatible with the above scenario where a failure many times is just found out whenever unencrypted data is leaked or the access log is required. The question is, then, how to improve it? Spoiler alert: the answer lies on the IaC itself :)

Shifting Left

Even before making sure a CloudFormation template is following AWS' own best practices, the first obvious requirement is to make sure that the template is valid. A fantastic open-source tool called cfn-lint is made available by AWS on GitHub² and can be easily adopted on any CI/CD pipeline, failing the build if the template is not valid, saving precious time. To shorten the feedback loop even further and fail even faster, the same tool can be adopted on the developer IDE³ as an extension so the template can be validated as it is coded. Pretty cool, right? But it still doesn't help us with the misconfiguration problem that we created with that really simple template in the beginning of this post.

Conformity⁴ provides, among other capabilities, an API endpoint to scan CloudFormation templates against the Well-Architected Framework, and that's exactly how I know that template is not adhering to its best practices. This API can be implemented on your pipeline, just like the cfn-lint. However, I wanted to move this check further left, just like the cfn-lint extension I mentioned before.

The Cloud Conformity Template Scanner Extension

With that challenge in mind, but also with the need for scanning my templates for misconfigurations fast myself, I came up with a Visual Studio Code extension that, leveraging Conformity's API, allows the developer to scan the template as it is coded. The Extension can be found here⁵ or searching for "Conformity" on your IDE.

After installing it, scanning a template is as easy as running a command on VS Code. Below it is running for our template example:

This tool allows anyone to shift misconfiguration and compliance checking as left as possible, right on developers' hands. To use the extension, you'll need a Conformity API key. If you don't have one and want to try it out, Conformity provides a 14-day free trial, no credit card required. If you like it but feels that this time period is not enough for you, let me know and I'll try to make it available to you.

But… What about my bucket template?

Oh, by the way, if you are wondering how a S3 bucket CloudFormation template looks like when following the best practices, take a look:

https://medium.com/media/4e74aa265a60d66b8e8f2a0b2cac208f/href

Not as simple, right? That's exactly why this kind of tool is really powerful, allowing developers to learn as they code and organizations to fail the deployment of any resource that goes against the AWS recommendations.

References

[1] https://www.gartner.com/smarterwithgartner/why-cloud-security-is-everyones-business

[2] https://github.com/aws-cloudformation/cfn-python-lint

[3] https://marketplace.visualstudio.com/items?itemName=kddejong.vscode-cfn-lint

[4] https://www.cloudconformity.com/

[5] https://marketplace.visualstudio.com/items?itemName=raphaelbottino.cc-template-scanner