News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
Write-Audit-Publish (WAP) is a pattern in data engineering that gives teams greater control over data quality. It was popularized by Netflix back in 2017 in by Michelle Winters at the DataWorks Summit called “Whoops the Numbers are wrong! Scaling Data Quality @ Netflix.”
The name is fairly self-descriptive:
What is WAP not (in this context)?
Why is WAP useful for data engineers?
The data engineering world has always lagged behind its software engineering brethren.
Concepts like source control were well established in software engineering for a decade or more before data engineers realised that there might just be something in the idea of not emailing around files called DIM_DATE_V1_FINAL_REVISED_v2_PROD.sql. (In fairness, it took a shift away from the old mindset of the established vendors too, in parallel with the emergence of the modern data stack for things to really click).
Write-Audit-Publish is very similar—or perhaps the same, if you squint—to the idea of Blue-Green deployments in the software engineering world. As data engineers we can learn a lot from established and proven patterns, and the Blue-Green one is a good example of this.
Why wouldn’t we want to adopt this, perhaps other than inertia and fear of something new? WAP is a perfect fit for both regular data pipelines as well as one-off data processing jobs.
Wanna Read More?
👉🏻Check out the full article that I wrote:
---
Full disclosure: I work for Treeverse, the company behind lakeFS.