From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data quality checks - GitHub Tutorial
From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data quality checks
- [Instructor] We'll conclude this chapter with reviewing the pipeline data quality checks or unit test. Let's start by defining the term data quality checks. Data quality checks or data unit test is the process of evaluating the data structure and its values with the use of set of deterministic and non-deterministic assumptions. Example of non-deterministic assumptions are data structure and data attributes. For example, the number of columns or their attributes such as numeric string, or time objects. The field names the value range. So for example, for our electricity data, we do not expect negative values and duplications. Likewise, examples of non-deterministic assumption or expectations are missing values, the value range. So for example, in our electricity data sets, we can measure the mean, the standard deviation, and set a threshold of when we want to alert if the standard deviation is higher from the mean and delays. Example for delay, if we expect to refresh the data every 24 hours, but the function could not identify new data available in the API in the last seven days, we should raise the red flag. It is recommended to set a unit test between two processes. For example, moving data from sources A and B. The unit test is the doorkeeper. It will allow the append of the new data to source B only if the new data passes the unit test we set. Likewise, it'll prevent the appending of new data. If the data fails, the quality checks. This will prevent a potential series of data integrity issues in our pipeline and with use of monitoring will enable us to address those issues on time. Here is the metadata that our pipeline collected throughout the refresh process. They create meta functions, run a sequence of test, for example, comparing if the timestamp of the data we collected is aligned with the ones we set, with the start and end argument of the get request. It then, based on some logic, defines if the data is good to get appended or not, and set a success criteria accordingly. The function will append the new data to the normalized table if and only if the success flag is set to true. Last but not least, the comments field provide some information about what test ran and fell, or warnings that generated throughout process. This enables the user to identify what is the status of the test. This table will be the best of our pipeline monitoring process. We'll dive into more details in chapter four.