From the course: Data Pipeline Automation with GitHub Actions Using R and Python
ETL supporting functions - GitHub Tutorial
From the course: Data Pipeline Automation with GitHub Actions Using R and Python
ETL supporting functions
- [Instructor] In the previous video, we reviewed the refresh process. In this video, we will review the ETL supporting functions. When building a process, I typically prefer to break it down into small minute processes when applicable, and then functionalize it. This makes the process more smoother and simpler to maintain. We'll use this approach for our ETL process. We can break down the refresh process into the following three mini processes. First is the data processing. For example, pulling the data from the API and transforming it from JSON objects into a DataFrame object. Next is the metadata, creating and updating the metadata tables and logs. And third and last is handling the append process of new data to the normalized table. To support those mini processes, I created the following five functions. First is the create_metadata. As the name implies, the function creates the metadata table for giving data input. It then ran some unit tests to evaluate if the data refresh was successful, and if we can append the new data to the normalized table. The load_metadata function is an helper function that reads the series details and merge it with the metadata logs. Third is the get_metadata function. This function checks if there are any new incremental data points or difference between the data in the source and normalized table by comparing the normalized log and the corresponding metadata available in the API. The append_metadata, as the name implies, appends new metadata that is created during the refresh process with the metadata table. And last but not least is the append_data, which append new data points to the normalized table. In addition, we will use the eia_metadata and the eia_backfill functions that we saw in the previous chapter.