Splice Machine’s Monte Zweben explains how feature stores can help cut down the monotonous parts of a data scientist’s job.
Many people pursue a career in data science because they love solving problems. But sometimes the work can feel a little like Groundhog Day, according to Monte Zweben, CEO of real-time AI company Splice Machine.
Zweben, who previously worked as the deputy chief of AI at NASA’s Ames Research Center and sits on the advisory board for Carnegie Mellon University’s School of Computer Science, believes feature stores can help. These are shareable repositories of features that could automate data processes into machine learning models.
‘Spending all your time on monotonous work can lead to unhappiness with the job’
– MONTE ZWEBEN
Can you explain what the Groundhog Day effect is for data scientists?
Work as a data scientist follows a cycle: log in, clean data, define features, test and build a model. Except not all parts of the cycle are created equal; data preparation takes 80pc of any given data scientist’s time.
No matter what project you’re working on, most days you’re cleaning data and converting raw data into features that machine learning models can understand. The monotonous void of data prep blends hours together and makes each day identical to the one before it.
With one person, it’s annoying to have to repeat the same work all the time; with a team, each person building features slightly differently can lead to inconsistent results.
Does this pose an issue?
From a productivity perspective, it’s incredibly inefficient for one person to repeat their own work multiple times. That’s time and money spent on unnecessary tasks, which makes models slower to get up and running.
From an employee perspective, spending all your time on monotonous work can lead to unhappiness with the job and increase employee turnover. For the business as a whole, lacking a centralised data process can also lead to inconsistencies in business.
If different people are defining features differently across a company, this can cause models and business decisions to differ based on feature definitions. Lifetime value of a customer (LVC) is a great example. One team might define the lifetime value as a customer’s total past spending, while another might include the customer’s projected value in the LVC.
Inconsistent definitions can lead to preferential treatment in a company and affect customer retention in the long term.
What are feature stores? How can they benefit data workers?
A feature store is a shareable repository of features made to automate the input, tracking and governance of data into machine learning models. Feature stores compute and store features, enabling them to be registered, discovered, used and shared across a company.
A feature store makes sure features are always up to date for predictions and maintains the history of each feature’s values in a consistent manner, so that models can be easily trained and retrained.
Feature stores enable total model transparency, guarantee consistent training and can serve models real-time updates of aggregate data sets.
How do feature stores work?
A feature store is a repository of features, feature sets and feature values, along with their feature history. The feature store has a set of services that interact with this repository, which includes defining features, searching for features, retrieving the current value of features, associating meta-data with those features, defining a training set from groups of features, and backfilling new features into training sets.
In some implementations, feature stores have user interfaces that call those services, and in others they are just APIs.
Feature stores are fed by pipelines that transform raw data into features. These features can then be defined, declared into groups, and assigned meta-data that makes them easier to search for. Once the features are in the store, they are used to create training views, training sets and serve features. These mechanisms allow feature stores to automate data transformation, serve aggregate features in real time and monitor models in real time.
How would you recommend data workers get on board with feature stores?
My number-one recommendation is to prepare for the future. Even if you only have a few models in production right now, I’ve seen so many data workers struggle to scale an ad-hoc data architecture. Within 10 years, the most successful companies will have hundreds and thousands of machine learning models running simultaneously; this will be impossible to manage without a feature store.
If you’re on the fence, just try one out! They’re easy to use and will seriously change your data workflow in the best way possible.
Are there any resources on the topic you would recommend?
Featurestore.org is a great central location for lots of information about feature stores. The Towards Data Science blog on Medium has some great content on feature stores, too.