Stories by Max Krog on Medium

GCP Serverless Design Pattern: Adhering to rate & concurrency limits with Cloud Tasks

Max Krog — Fri, 11 Sep 2020 13:40:41 GMT

Even though I consider myself knowledgeable about the multiple GCP products related to data engineering, I had not heard of a use case for Cloud Tasks before.

This post aims to shed some light on the use case for Cloud Tasks by bringing a specific problem to the table and discussing it from a PubSub vs Cloud Tasks perspective.

The Challenge

As part of a customer-data-segmentation project I encountered the challenge of sending user data to the Google Ads Remarketing Audience API, which has the following restrictions:

Every request can only contain around 50.000 user records. Performance grows exponentially slower with more records per request.
Only one (1) request can be processed per Google Ads Account at a time. Submitting another request simultaneously causes all other ongoing requests to error out.
Every request takes between one to five minutes to process.

Some context on the project:

The data that needed to be sent to the API would be arriving in Cloud Storage from BigQuery in the form of a CSV file. Every expected CSV file would contain between 50.000 and 4.000.000 records.

The orchestration engine (Cloud Composer) was only responsible for running the business logic in BigQuery and saving the result to Cloud Storage. Cloud composer does not handle the outgoing data pipeline as the outgoing data pipeline should be reactive & serverless.

The customer base that was to be segmented contained around 4 million customers and each customer would belong to at least one segment. The orchestration engine would run one business logic query — per segment to be pushed to Google Ads - in BigQuery, resulting in 7 different files arriving in Cloud Storage during the span of a few minutes.

Partial Solution

When a new file lands in Cloud Storage, an event can be submitted to PubSub or fed immediately to a Cloud Function. More information about this here.

By setting up a Cloud Function to be triggered from the Cloud Storage Bucket we can solve api restriction 1 (max 50k records per request), by splitting the file into smaller parts.

But what now? Simply looping over the original file and pushing the partial chunks of records into Google Ads is a very fragile way to handle things. Just consider the following:

How long do we expect the function to run?

4.000.000/50.000=80 parts. Let’s say each part of 50.000 records takes 5 minutes to transfer. We’re talking almost 7 hours of continuous runtime. This is way further than what Cloud Functions support.

What if we get another file landing in GCS when we’re already transferring one?

As stated in the previous section, our orchestration engine will produce one file per segment. That means that when we’ve started pushing the chunks of the first segment to Google Ads, another will appear and trigger Cloud Function execution in parallel. As per point two (2) of the API restrictions this would cause both requests to fail. Without some way of ensuring that only one concurrent dispatch can happen we would have to spread out the time between the segments arriving in Cloud Storage by quite a bit.

The problem with Cloud PubSub (for this challenge)

My initial approach was to follow the architecture outlined in this solution architected by Google: A serverless integration solution for GMP. This solution makes use of a combination of Initiator and Transport functions, as well as 3 PubSub Topics. To understand their proposed architecture, head to the Architectural Overview section of the solution.

I believe that the architecture used in that solution can be slimmed down a bit and explained like this:

New files lands in cloud storage and triggers the cloud function Initiator
Initiator splits up the file into several smaller bits and publishes them to the PubSub topic Operation Log.
After finishing (2), Initiator sends a empty message to the Operation Trigger PubSub Topic
Operation Trigger pushes the empty message via a subscription to Operation Executer
Operation Executer executes and starts with pulling a message from the Operation Log. If no message remains in Operation Log, nothing more is done. The cycle ends here.
If a message was retrieved from Operation Log, Operation Executer tries to push it to Google Ads.
If the previous step (6) was successful, Operation Executer acknowledges the previously pulled message from Operation Log and publishes an empty message to Operation Trigger. Continue from 4.

Problems with this approach

I believe there are several problems with this approach:

If operation executer errors or fails, the whole cycle could potentially be broken. If the error is because of a Google Ads API-side error, you need to catch this in your code and send an empty message to Operation Trigger to continue the cycle. This is not inline with the Fail Fast philosophy (more on that below) and is more demanding to develop.
If the error above happens when there’s only one (1) message left in the operation log, that message is not available to be pulled from the Operation Log until the acknowledgement deadline has passed. The subsequent execution of Operation Executer would think there’s no more messages left and end the cycle.
To adhere to restriction 2 (concurrent requests) of the Google Ads API, there can only be one message flowing between Operation Trigger and Operation Executer. This would be violated when the second segments-file arrives in Cloud Storage.
I believe the architecture is complicated to debug and understand what’s going on.

This solution would work better if we didn’t have the restrictions of the output API (Google Ads) that we do.

As this approach was not viable without some hard thinking, I decided to look for alternative approaches.

Introducing Cloud Tasks

Cloud Tasks is a distributed task queue. You define one or many queues to which you can send tasks. Queues are what they sound like. Tasks are things to be done, usually defined as ‘run this HTTP-request and wait until you get a 200/OK-response code back’, if not, try again in X amount of time.

On a queue level you have the following (and more) settings:

Max dispatches per second: How quickly can this queue process new tasks?
Max concurrent dispatches: How many tasks can be running/executing at the same time?
Max attempts: How many attempts can be made on a task before it’s put in a “failed” state?

On a task level you have the following (and more) settings:

Task type: We’ll only be discussing the HTTP Target type in this post.
HTTP Task httpMethod: The HTTP Method (GET/POST)
HTTP Task Request body: The HTTP request body. Max 100kb.

Please note that Cloud Tasks is not a message queue (like PubSub), it’s only interested task definition and supports a max 100kb request body to describe where to find any eventual data.

Real life queue. Image credit: Alexander Popov, Unsplash, https://unsplash.com/photos/Xbh_OGLRfUM

Full Solution with Cloud Tasks

To integrate Cloud Tasks with our partial solution, we can break down the big transfer segment files arriving in Cloud Storage into bits and create a task for each bit. We’ll send all of these tasks to a Task Queue that we have configured with Max concurrent dispatches set to one (1), to avoid overrunning the Google Ads API.

As Cloud Tasks does not execute the task (they only call an HTTP-endpoint and wait for a 200/OK code back), we put the actual Google Ads API call inside a Cloud Function.

Architectural Overview

Architecture diagram utilising Cloud Tasks to break up one large request into multiple smaller requests.

BigQuery writes resulting output file to GCS.
Upon finalising writing to GCS, the Cloud Function Task Creator is triggered from GCS.
Task Creator reads in the file and splits it into several smaller parts, saving each part to another GCS bucket and creating a task for each part in a the Segments queue in Cloud Tasks. Task Creator gets the target API and other attributes from the filename of the triggering file in GCS, it sends this along with the path to the partial file in the Task Body.
The tasks in Segments queue are processed one by one, each invoking the Cloud Function Task Handler.
For every invocation of Task Handler it decodes the Task Body and retrives the path to the partial file in GCS as well as the API Configuration. It then pushes the partial file to Google Ads and returns a 200/OK response code to Cloud Tasks when finished.

What does this architecture give us?

Cloud Tasks acts as a buffer between our input and output, ensuring the output adheres to the rate limits of the API.

It also helps us in our development speed by supplying a layer for retries, enabling us to write integration code that fails fast. Should the Task Handler function fail, just let it fail and default to sending an http-error-code back to Cloud Tasks. Cloud tasks will retry the task again in due time. As long as the error was on Google Ads side, the task will eventually execute successfully.

Scaling out

It’s also easy to scale out, to support more accounts (let’s say for an advertiser active in several countries) in Google Ads we only need to create additional queues in Cloud Tasks. We add a bit of code to the Task Creator so it can choose a Task Queue Dynamically, as well as some code to Task Handler so it can get the API credentials/configs dynamically. Other than that, exactly the same functions can be used.

Final words

This use case was clearly not intended to be solved by PubSub, and I’m happy that it wasn’t, since it taught me a lot about the limitations with PubSub.

Cloud Tasks are clearly aimed at this domain of problems and it’s a really easy service to pick up. I strongly encourage everyone to play around with it, since it might just be one of the best (serverless) services for handling with poor performance APIs.

I’m planning to play a bit with combining Cloud Tasks and PubSub for this integration challenge. I have an idea for how they could work together.

Feel free to reach or leave a comment if you want to discuss this further or disagree with any of the points I’ve made. I’m very much here to learn 👨‍💻🤓

GCP Serverless Design Pattern: Adhering to rate & concurrency limits with Cloud Tasks was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

My local environment setup for Data Engineering on GCP

Max Krog — Fri, 04 Sep 2020 15:42:53 GMT

My personal reference sheet for analytics engineering on GCP

This guide is intended to be a handy reference for myself when i’m looking for a specific link or command, or setting up a new workstation. I’m expecting this guide to grow with more commands over time.

I’ll be covering the following areas in this guide:

Homebrew
Pyenv
Virtual env
Google Cloud Python Client
Google Auth Library
Authentication from environment variables
Handy ~/.bashrc or ~/.zshrc lines

Homebrew, Pyenv & Virtual env

This holy trinity should be taught in all tutorials. If you want control and understanding of your local environment, the combination of homebrew, pyenv & venv is the only way to go.

Homebrew

Homebrew is the de-facto package manager for mac/linux. Find more information about how to install it here.

Commands to know:

brew doctor              #Performs a health-check on your install
brew install *name*      #For installing cli-based applications
brew cask install *name* #For installing gui-based applications

Pyenv

Forgot about ‘installing python’. Install pyenv (with homebrew) and use it to install and select python versions. Pyenv basically intercepts the command ‘python’ in your terminal and makes it point to the specific python version you want. More information on pyenv can be found here.

brew install pyenv

To make sure every terminal session has pyenv initiated you need to put an init-script in your .bash_profile or .zshrc. Information on how to do this can be found after point at point 3 of ‘Basic GitHub Checkout’ here.

When you’ve got pyenv working you can install your preferred python version like so:

pyenv install 3.7.8   #Installs python 3.7.8

pyenv global 3.7.8    #Sets python 3.7.8 to be your global version

Type ‘python’ in your terminal and watch the magic of pyenv.

python --version
> Python 3.7.8

Venv

Packages in Python are by default installed to a global packages folder. If you want to ensure your code performs the same in the cloud as on your local computer, this is not ideal.

Venv solves this problem by creating virtual environments that are project specific. Packages can be installed to this virtual environment instead of the global scope. By utilizing a requirements.txt file to keep track of packages you want installed for a specific project you can ensure consistency between the cloud and your local development environment.

The venv-module is bundled with python since 3.5.

To create a venv-config folder run the following in your terminal:

python -m venv venv-config      #Creates the venv-config folder
source venv-config/bin/activate #Takes you inside the virtual env

You are now inside the virtual environment. Feel free to install any packages you want. Preferably these are listed in a requirements.txt file and can be installed with this command:

pip install -r requirements.txt

When you want to leave the virtual environment, type:

deactivate

GCP client libraries

To be able to access googles services you need a client library that be ‘imported’ from and used as a module in your Python code. This comes in two forms (one for all discovery-based APIs and one for interacting with services on GCP). There’s a slight overlap from the first one to the second one, my advise is to use the second one when available

Google Cloud Python Client

Supports all GCP services. Please note that this is just a container repo, all specific clients have their own specific libraries. For example the Cloud Storage API Client can be found here

All client libraries can be pip installed (or put in requirements.txt) on the format:

google-cloud-*service*

And imported in your code (main.py) like so:

from google.cloud import *service*

Please note that there’s also a Google API Python Client. This library is intended to be used for the discovery-based APIs, that is to say Googles products outside of GCP. For example the Google Analytics API. I found this rather confusing at first.

Google Auth Library

This library contains the authorization-layer depended upon by all the google-cloud-*service* libraries. For clarity i like putting this in the requirements.txt when i’m specifically using it:

google-auth

And this is how you go about creating a client with specific credentials:

from google.oauth2 import service_account
from google.cloud import storage

credentials = service_account.Credentials.from_service_account_file(
    'path_to_service_account_key.json',
    scopes=['https://www.googleapis.com/auth/devstorage.read_only']
)

storage_client = storage.Client(credentials=credentials)

A full list of available scopes can be found here.

Authentication from environment variables~

~/.bashrc or ~/.zshrc

Depending on your shell driver, one of these files is executed when you start a new shell session. By placing initializiation calls and the definition of handy environment variables here you…

GOOGLE_APPLICATION_CREDENTIALS

When deploying functions or apps to GCP, GCP automatically injects a service account to be used by all clients created in the code that It does this by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to a path with the service account json key.

In order to get the same feature to work when developing and testing code locally you need to set up the environment variable GOOGLE_APPLICATION_CREDENTIALS to have an absolute path to the service account json key that you want to use for local development.

export GOOGLE_APPLICATION_CREDENTIALS= "/path_to_key.json"

With this you can initialize clients in this way:

from google.cloud import storage

#Client with credentials from environment variable.
storage_client = storage.Client()

Handy ~/.bashrc or ~/.zshrc lines

Instead of memorizing venv-related commands, give them easy to remember aliases:

alias venv="python -m venv venv"
alias venva="source venv/bin/activate"
alias pipi="pip install -r requirements.txt"

pyenv:

export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"

eval "$(pyenv init -)"