From the course: Data Pipeline Automation with GitHub Actions Using R and Python

Data pipeline scope and requirements - GitHub Tutorial

From the course: Data Pipeline Automation with GitHub Actions Using R and Python

Data pipeline scope and requirements

- [Instructor] In this video, we will define the Data Scope and Data Pipeline Requirements. Before getting started with the scope and requirements, let's first define what is the data pipeline. A simple definition of a data pipeline is the process of moving data from one data source to another. In most cases, it includes intermediate steps such as data processing, cleaning, data transformation, aggregation, and creating new fields. The process also defined as ETL, which stands for extract, transform, and Load. Typically, the common terms of the different stages of the data in this process are; Raw for the data source, Calculation for the intermediate steps and Normalized for the final output. Moving forward, we'll refer to our raw data, the AI API as the source of raw data and the process data as normalized. In the previous chapter, we saw the process of pulling data from the API to our local machine where the API in this case is our raw data source, which comes in a JSON format. And our final output or normalized table was the DataFrame object. The term data pipeline by itself does not define the level of automation of this process. It could derive from a completely manual process such as coping files manually, or running a script locally on your machine to process data, to a fully automated process as we will build in this course. With that, let's go ahead and define our data scope for this course. First, we want to pull hourly demand for electricity by subregion, where the Geoscope is the California Independent System Operator subregions, which include the following four; Pacific Gas and Electric, San Diego Gas and Electric, Southern California Edison and Valley Electric Association. We want to refresh the data daily. After we defined the data scope, we can go ahead and define the data pipeline requirements. First, we want it to be fully automated, meaning once we deploy it in production, it should be run automatically without the intervening of the user. We want the data pipeline to have high level of customization, so think about the scenario that you want to add a new subregion, and you don't want to manually go and out code the changes. We want to add in place data quality checks and unit test to ensure the quality of the data. And last but not least, we want to monitor the health of the pipeline. Last but not least, we'll create and deploy the pipeline with both R and Python. The supporting files of the R data pipelines are under the R folder in the course repository. Similarly, the Python data pipeline supporting files are under the Python folder. In the next video, we'll review the data pipeline architecture.

Contents