From the course: Advanced Data Processing: Batch, Real-Time, and Cloud Architectures for AI
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
Batch feature engineering
From the course: Advanced Data Processing: Batch, Real-Time, and Cloud Architectures for AI
Batch feature engineering
- [Instructor] In this video, we will discuss architectures for feature engineering in batch AI. Let's walk through a reference template pipeline for batch feature engineering. Typically, feature engineering may require data from multiple sources. These data sources could be files, databases, or cloud services. Typically, each data source will have a data extractor or a transfer job. These jobs run periodically, connect to data sources, and fetch batches of records. Data acquired from data sources are saved in a data lake in its raw form. This becomes a local copy of the data from where further processing can be done with repeatability. From here, a series of jobs can be run to transform data. Data cleansing and extraction jobs can filter and merge data sets. Outputs of these jobs can be stored in temporary data stores. Another set of feature transformation jobs can run to transform the data to forms that can be consumed by machine learning. This processed and transform data is…