From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data refresh process - GitHub Tutorial
From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data refresh process
- [Instructor] In this video, we will focus on the data refresh process. Recall, the goal of the data refresh process is to keep our normalized table aligned with the most recent data that is available in the source data. In this process, the function's main logic when triggered is to check if new data is available on the data source, and if so, to pull the incremental data, process it, and append it to the normalized data. Note that in some cases, you may want to pull data beyond the incremental data. For example, let's assume that you are working with sales data and restatements may have occurred in the data during the last seven days due to the company's product return policy. Therefore, in this case, each time the pipeline refreshes the data, you may want to repull the last seven days, in addition to the incremental data. This adds some complexity to the process, as you will have to drop the overlapping observation when appending the data back to the normalized table. You want to ensure that the append process won't create duplication or data gaps. In our case, we will set the refresh function to pull only new data points. Let's now focus on the logic of the data refresh process that we will deploy on GitHub Actions. The refresh process is set inside the Quarto dock. One of the main reasons that I love to use Quarto docks to run my code on GitHub Actions is that it is a great way to communicate the refresh process when running code on a remote server. It generates an HTML report that you can customize according to your needs. We'll dive into more details about the functionality of this process later in this chapter. The refresh process leverages a set of helper functions that handle the low capturing data quality test, appending the data back when applicable to the normalized table. We'll review those functions in the next video. Once the process is triggered, it loads the series information from the series.json file. This file defines the metadata of the series we want to pull from the API and their corresponding route. This will enable us to seamlessly onboard new subregions or remove existing ones without the need to hard code the changes. After the function pull the series information, it starts to build the data profile for each series in this list. First, it pull the metadata to identify the last data point available for each series in the normalized table, and based on this, calculate the GET request starting point. Then by using the eia_metadata function, we send a GET request to pull the metadata of each series and check if new data is available on the API. By comparing between the timestamp of the most recent data point available on the API and the one available in the normalized table, the function makes the call is to send a GET request to pull the incremental data. If new data points are available, the function will extract them from the API and transform them into a DataFrame using the eia_backfill function and append them back to the normalized table. In any case, if new data is available or not, the function will create a log and update the metadata. Throughout the course, we will simulate a real-life data automation process using GitHub Actions with a caveat that for learning purposes, we will save back the files to the repository as opposed to database. The goal here is to practice the deployment of a pipeline and learn how to work with GitHub Actions. Generally, you should avoid storing large files on GitHub repository. Please note that during the data refresh process, we will use a set of helper functions to support the refresh process, which we'll review in the next video.