From the course: Data Pipeline Automation with GitHub Actions Using R and Python

Data backfilling

- [Instructor] So far we reviewed the different component of the data pipeline. In this video, we'll review the data backfilling process. Let's first define what data backfill is and why we need it. Data backfill is typically defined as the initial loading of the store call data of the dataset, which in our case is the load of all these local data of the four sub-region series. As we have closer to six years of hourly data, this is a pool of about 50,000 observations per series, or an overall pool of 200,000 data points. For comparison, the regular refresh process loads about 24 observation per call if the refresh process run daily. This mean that the magnitude of the data load of the backfill is more than 2,000 times bigger than the regular refresh process. And this is also why you typically would prefer to run the backfill process locally and not on the server. The backfill process follow a fairly similar steps as the data refresh process we saw earlier. The main difference is that it initiates the process and not checking for differences. It start by creating, loading the JSON file with the series information, building the profile, and go to the API and load the data based on the time range that we provide. It generate metadata and append, or push the data, to the normalized table. Note that if any data were available before it will override the metadata in the normalized tables. Let's now open VSCode and review where you can find those files. The backfill functions, the R version and the Python version, run inside a Quarto doc. Here on the left side you can see the Python version named data_backfile_py, it's a Quarto doc. Similarly, on the right side you can see the R version named data_backfill_R, and again it's Quarto doc. Both of those versions mirror each other. You can find those folders under the python and R folders. Once rendered it generate HTML output. So if you go over here under the Python version, you can see that there is a file named data_backfile_py and it's a HTML file with the rendered output of the backfill. Similarly, in the R folder, you can find the corresponding R file rendered into HTML. Let's go now to the browser and review the outputs of those files. You can see on the left side I have the Python version, and on the right side I have the R version. Giving that both of them are alike, let's see the Python version. I'm not going to go line by line, just go to explain overall the processes that this function is doing. We start like before by loading the libraries. You can see that we are loading here additional libraries such as Pandas and NumPy to process the data. And we're going to use libraries such as JSON to read the series.json file with the metadata of the series that we want to pool. In addition to the local script that we used before, the eia_api, we also loading the supporting functions for the process from the eia_data file. After we loading the series metadata, we start to set the parameters for this pool. Given that we want to have a backfill and we want to pool all this local data, we set the start point for the first data observation that's available for this photo series, which is July 1st, 2018. And we set the endpoint to the most recent date during the runtime of this backfill. And then we set the API parameters for the backfill function such as the offset, load the API key, we set the path of the metadata and the data folder. Then we start by generating the metadata. We check what we're comparing with the metadata available on the API to make sure that the start and endpoint align with the one that we set over here. And this is where the magic happened. This for loop function loop over the different series and pool the data and process it. Here is the cleaning of the data, the test and unit test are done. As you can see, just give indication it print each series that ran during this process. Then we generate the metadata and append it along with the data. You can see that in the metadata, all the series has some missing values, and this is one of the things that you need to be familiar with the data. In this case, all those four series says 98 missing values in a given period due to some missing data in the source data. So given that we know that this is okay, we are going to set the success to True and then we update the process and we set it also to True, and this is where it's done. And last but not least, we will visualize the series. This enable us to eyeball in the series to see if there is anything that our test didn't test and we want to give some attention. In the following video, we are going to review the refresh process that is done similarly to the backfill process in Quarto doc.

Contents