MLOps: The Role of Feature Stores

Feature stores are like data warehouses for data science. Their primary goal is to enable data scientists to short-circuit the time it takes to go from data ingestion to ML model training & inference. As such, they fill an important gap in the MLOps Lifecycle.

Pronojit Saha

Published in

Towards Data Science

4 min readJul 6, 2021

by Pronojit Saha, Dr. Arnab Bose & Navin Ahuja

What is a Feature Store?

Feature store is a data management system for managing machine learning features, including the feature engineering code and the feature data. It is a central vault for storing documented, curated, and access-controlled features that can be used across many different ML models across the organization. It ingests data from various sources and executes defined transformations, aggregation, validation, and other operations to create features. A feature store registers available features and makes them ready to be discovered and consumed by ML training pipelines and inference services.

MLOps Lifecycle

We define the MLOps lifecycle at a very high level to include the following components as shown in Fig. 1 across development (“Dev”) & production (“Prod”) environments.

Development Environment

DataOps: The first step of the MLOps life cycle involves all aspects of data from building a data ingestion pipeline to acquiring data from various sources. This is followed by data verification & validation incorporating validation logic in the ingestion pipeline. Next is data preparation by processing, transforming, and finally persisting it with proper versioning. These steps comprise the data pipeline that is built and tested in a Dev environment on batch or real-time data, as the case may be.
ModelOps: Once data is acquired, exploratory data analysis (EDA) is done to get initial insights and generate reports. Understanding the data post EDA helps in the next step — feature engineering. ML algorithm is then developed using the features. The algorithm code is then ideally versioned using a version control system (for example, git), and tests are carried out to ensure the code is functioning as expected. Finally, the data and model pipelines are deployed for execution and testing in the Dev environment. Once the pipelines are deployed, various experiments are carried out with different input values and different model architectures. The results are evaluated in terms of model metrics. Furthermore, model interpretation and error analysis are done to understand the model better and improve its performance. Finally, the best model is chosen from the various experiments to be deployed in Prod.

Production Environment

DataOps & ModelOps: Using a serving tool the data and model pipelines built-in Dev is deployed in Prod. After deployment, the data pipeline acquires, validates, and prepares the data on which inference is to be done. The model pipeline then uses the inference data to generate predictions/reports.
ProductionOps: Across DataOps & ModelOps in Prod, constant monitoring of the data & model performance is done to ensure there is no drift. Various metrics help us do the same, and we can learn more about them from our earlier article in the series here. This monitoring initiates an important feedback loop back to the development cycle for further model re-training and/or other changes (if required) and completes our MLOps lifecycle. Furthermore, various governance (user control, ethics, bias, etc) and security (data, model) aspects are also handled.

The gap in MLOps Lifecycle

One might be wondering why the feature store is left out in the above discussion. Organizations have been implementing various flavors of feature store and for most purposes, it is included as part of ModelOps. However, a feature store should be carved out separately (as shown in figure 2 below) as it enables FeatureOps, leading to a faster turnaround & experimentation time during ModelOps. It also reduces redundancy in ModelOps across several use-cases inside an organization.

The Gap: FeatureOps is the gap in the present MLOps lifecycle. In it’s absence, organizations & data scientists do lot of duplicate work in creating the same features again and again, and then validating them for different use-cases. This increases the time-to-market significantly. Further new data scientists may fail to conceive important features in their modeling which may have been done by senior team members earlier. This impedes agile development and effective team collaboration.

We will get into more detail of FeatureOps in the next article, where we understand the major components and functionalities of a feature store. For now, it is important to note that none of the big ML platform providers (AWS, GCP, Azure) had a well-defined product for this functionality until 2020. Recently in December 2020, Amazon announced a SageMaker Feature Store, followed by Google in May 2021 who released their MLOps platform VertexAI which has a feature store component. Lastly, Databricks in June 2021 released a public preview of their feature store implementation that is supported on the Azure platform. Apart from these, we have startups like Hopsworks and Tecton and some open-source projects like Feast who are leading the charge in the feature store. Abzooba Inc. has an MLOps platform called xpresso.ai that builds on much of the concepts discussed here. As such, a multitude of efforts is being made across the industry to fill this gap in the MLOps lifecycle.

Conclusion

It is commonly said that 80% of a data scientist’s time goes into data wrangling, including tasks like sourcing of data, ingestion of data, cleaning it, and featurizing the data. A well-defined feature store makes features for available in a single place that short-circuits the time. It also promotes team collaboration, reuse, and leads to faster model experimentation and inference outcome.