From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data pipeline architecture - GitHub Tutorial
From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data pipeline architecture
- [Host] In the previous video, we reviewed the data scope and pipeline requirements. In this video, we'll review the data pipeline architecture to automate the California sub regions demand for electricity data. We'll use the following deployment. Let's now break it down into the its different components, starting with the EIA API, our source data or raw data. In the previous chapter, we reviewed how we can set and send a gate request to pull metadata and data from the API using the EI metadata and the EI backfill functions. The pipeline supporting functions will leverage those functions to extract data from the API. The second component is the data pipeline, whose main functionality is to check if new data is available in the API and refresh the data when applicable. In addition, this function also collect metadata on each steps enabling us to monitor the health of the data pipeline. The process is deployed on GitHub actions and we'll dive into more details about the deployment in the next chapter. In the local environment, we have the backfill function. The goal of this function is to restart the pipeline whenever needed and backfill all this local data. Typically, it is a good practice to separate the backfill process from the refresh process as the initial data pool of this local data might be heavy and require more computing resources then available on the scheduler. In this case, GitHub actions. Last but not least is the data visualization component will deploy a simple dashboard on GitHub pages that will enable us to view the data and track the logs. Once the data pipeline finishes updating the data with the new data points, GitHub actions will update the dashboard with the new data. We'll focus on this component in chapter four of this course. Throughout the rest of this chapter, we'll focus on the data refresh and backfill functions.