From the course: Data Pipeline Automation with GitHub Actions Using R and Python

Data refresh output

- [Instructor] In the previous video, we reviewed the outputs of the backfill functions. We saw that there is a R and Python version rendered in Quarto Documentation. In this video, we will review the data refresh functions and their outputs. The data refresh functions running inside a Quarto Doc similar to the backfill function. And you can see on the screen over here that there are two versions. On the left side, there is the R version, data_refresh_R and it's a Quarto Doc, and on the right side, there is the Python version, which is named the same ending with py. You can find those folders on the corresponding R and Python folders in the Course Repository. Once the data pipeline is running and executing the process, it will render it and save those files in the docs folder. If we open the docs folders, you can see two folders here, data_refresh_python for the Python files, and you can see the HTML file here. Similarly, for the R version, you will have data_refresh_R and you will have the HTML version. Let's go ahead and see the outputs on the browser. Like before, there are two versions. Previously, we saw the Python version of the backfill. This time, let's go ahead and check the R version. Both of them are mirror each other. We are starting with loading the libraries. In this case, we are using almost the exact same libraries as before, but we also using the GT library to plot the outputs in a table format and the jsonlite to load the series.json file. We start by loading the JSON file and creating the mapping of the series. We also set the API parameters. For example, we set the template for the facets. We define the offset and load the API key. In addition, we define the metadata and data path for the output files. Then we start with the process of identify if there is any incremental data in the API. We used the get_metadata to load the metadata from the local metadata file and compare it with the metadata that available in the API. So this is the output over here for the FALSE (indistinct). And as you can see, the most recent data point, the timestamp of the most recent data point in the local normalized table is February 28th, eight o'clock in the morning. So if you would request the next data point, the starting point of the get request should be one hour increment, and it would be nine o'clock on February 28th. Then we compare with the most recent timestamp in the API and we see that this is matched to the ending point. This mean that no new data points or incremental data points available in the API and the updates_available flag set to FALSE. Which mean that we will skip the next step of refreshing the data. If the updates_available would be TRUE, then the function would go ahead and execute this process, which loop over each series and pull the corresponding data points. One things that I like to do when I'm generating this type of documentation is to leave messages. So for example, here I'm printing, "What was the output?" And this output is, you can see forward series, there was no new data point available. In any case, we will capture metadata even if we didn't pull new points. And as you can see that some of the parameters here are not relevant for this pull, because we didn't pull any new data and we set them as missing value or NAs. We're ending by appending the new metadata and we plot the series as before.

Contents