Feature Store for ML

Page maintained by: https://dcatkth.github.io/

What is a Feature Store?

The ‘Feature Store’ is an emerging concept in data architecture that is motivated by the challenge of taking ML applications into production. Technology companies like Uber and Gojek have published popular reference architectures and open source solutions, respectively, for ‘Feature Stores’ that address some of these challenges.

The concept of Feature Stores is nascent and we’re seeing a need for education and information regarding this topic. Most innovative products are now driven by machine learning. Features are at the core of what makes these machine learning systems effective. But still, many challenges exist in the feature engineering life-cycle. Developing features from big data is an engineering heavy task, with challenges in both the scaling of data processes and the serving of features in production systems.

Benefits of Feature Stores for ML

  • Track and share features between data scientists including a version-control repository

  • Process and curate feature values while preventing data leakage

  • Ensure parity between training and inference data systems

  • Serve features for ML-specific consumption profiles including model training, batch and real-time predictions

  • Accelerate ML innovation by reducing the data engineering process from months to days

  • Monitor data quality to rapidly identify data drift and pipeline errors

  • Empower legal and compliance teams to ensure compliant use of data

  • Bridging the gap between data scientists and data & ML engineers

  • Lower total cost of ownership through automation and simplification

  • Faster Time-To-Market for new model-driven products

  • Improved model accuracy: the availability of features will improve model performance

  • Improved data quality via data ->feature -> model lineage

Feature Store Concepts

Consistent Features – Online & Offline

If feature engineering code is not the same in training and inferencing systems, there is a risk that the code will not be consistent, and, therefore, predictions may not be reliable as the features may not be the same. One solution is the have feature engineering jobs write their feature data to both an online and an offline database. Both training and inferencing applications need to read their features when they make predictions – online applications may need low latency (real-time) access to that feature data. The other solution is to use shared feature engineering libraries (ok, if your online application and training application are both able to use the same shared libraries (e.g., both are JVM-based)).

Time Travel

“Given these events in the past what were the feature values of the customer during the time of that event” Carlo Hyvönenarrow-up-right.

Time-travel is not normally found in databases – you cannot typically query the value of some column at some point in time. You can work around this by ensuring all schemas defining feature data include a datetime/event-time column. However, recent data lakes have added support for time-travel queries, by storing all updates enabling queries on old values for features. Some data platforms supporting time travel functionality:

Feature Engineering

Michelangelo added a domain-specific language (DSL) to support engineering features from raw data sources (databases, data lake). However, it is also popular to use general purpose frameworks like Apache Spark/PySparkarrow-up-right, Pandasarrow-up-right, Apache Flinkarrow-up-right, and Apache Beamarrow-up-right.

Materialize Train/Test Data?

Training data for models can be either streamed directly from the feature store into models or it can be materialized to a storage system, such as S3, HDFS, or a local filesystem. When multiple frameworks are used for ML – TensorFlow, PyTorch, Scikit-Learn, then materializing train/test data into the native file format for the frameworkarrow-up-right (.tfrecords for Tensorflow, .npy for PyTorch) is recommended.

Common file formats for ML frameworks:arrow-up-right

  • .tfrecords (TensorFlow/Keras)

  • .npy (PyTorch, Scikit-Learn)

  • .csv (Scikit-Learn, others)

  • .petastorm (TensorFlow/Keras, PyTorch)

  • .h5 (Keras)

Online Feature Store

Models may have been trained with hundreds of features, but online applications may just receive a few of those features from an user interaction (userId, sessionId, productId, datetime, etc). The online feature store is used by online applications to lookup the missing features and build a feature vector that is sent to an online model for predictions. Online models are typically served over the network, as it decouples the model’s lifecycle from the application’s lifecycle. The latency, throughput, security, and high availability of the online feature store are critical to its success in the enterprise.

Last updated

Was this helpful?