NoSQL for Real-Time Feature Engineering and ML Models

Building User Profiles with Streaming Data

Published in

Towards Data Science

11 min readSep 7, 2020

For the majority of my data science career, I’ve built machine learning models using data fetched from a data warehouse or lake. With this approach, you can create a feature vector for each user by applying SQL commands to transform several events into a user summary. However, one of the main issues with this approach is that it is a batch operation and may not include the most recent events for a user, if it takes minutes or hours for events sent from an application to show up in the data store.

If you need to use recent data points when applying predictive models, then it may be necessary to build a streaming data pipeline that applies feature engineering in near real-time to build user summaries. For example, an e-commerce web site may want to send a notification to users that add items to a cart, but do not check out, based on the output of an ML model. This use case requires a model to use data based on recent web session activity, which will likely not be available in a data warehouse due to latency introduced through batching and ETL operations. For this scenario, it may be necessary to incrementally build a user summary based on a streaming data set in order to provide up-to-date model predictions for determining whether or not to send a notification to the user.

I’m a data scientist working for a mobile game publisher, and we encounter similar scenarios, where we need to use recent session data to determine how to personalize the experience for users. I’ve recently been exploring NoSQL data stores as a way of building near real-time data products that perform feature engineering and model application with minimal latency. While this post will walk through a sample pipeline using Python, data scientists that want to get hands-on with building production-grade systems should explore Java or Go, as I’ve discussed here. In this post I’ll introduce NoSQL for data science, introduce options for using Redis, and walk through a sample Flask application that performs real-time feature engineering using Redis.

NoSQL for Data Science

NoSQL data stores are an alternative to relational databases that focus on minimizing latency while limiting the types of operations that can be performed. While there is a broad set of services that fall within the concept of NoSQL, one of the most common implementations is a key-value data store. With a key-value store, you can save and retrieve elements from a data store with sub-millisecond latency, but you cannot query across the contents within the data store. You need a key which is used as an index for saving a value, such as a user ID, and the same key is used when retrieving a value. While this set of operations may seem limiting for data scientists used to working with relational databases, it provides latency and throughout performance that is orders of magnitude faster than traditional databases.

This type of workflow, where you perform create-read-update-delete (CRUD) operations on data sets may not be familiar for data scientists, because it is common to work with batch processes, where SQL can be used to transform events into user summaries. With a key-value data store there are no SQL commands that can be used to aggregate data sets, and instead, user summaries need to be built incrementally, where new data received by the system is used to update the user record, such as incrementing the session count each time a user opens the game on a phone.

Using a NoSQL data store enables a data product to update user summaries in near real-time as new data is received. This means that the feature vector for a user will lag by only seconds for each new data point, versus minutes or hours when using traditional data warehousing workflows. There is a trade-off when using this approach, which is that the types of features that you can use are limited versus the options available when using a SQL database. For example, you cannot count distinct values or calculate a median value when updating data incrementally, since you are working with single events rather than historical events for a user. However, most systems that use real-time data will also include features from a delayed process that provides a more complete user profile. For example, with streaming data it’s not possible to count the distinct number of modes played by a user (unless 1-hot feature encoding is used), but there could be a batch process that calculates this value with additional lag that is appended to the feature vector.

Data scientists should explore NoSQL solutions when they need to build user profiles that are updated incrementally versus batch processes. A common use case for this type of workflow is when you need to provide personalized treatments for users based on recent session data. For example, users that have previously opted into notifications may be more likely to interact with future notifications within a new game. A mobile game publisher can leverage this session data to provide personalized experiences for users.

To provide a concrete example of what this looks like, we’ll use the Kaggle NHL data set to provide a streaming data source where Hockey player profiles are updated incrementally based on a real-time data source, implemented via web REST calls. The output of this sample application will be user profiles stored in Redis that can then be used to apply ML models in real time. We’ll focus on the feature engineering steps rather than the model application steps, but show how the user profiles can be used for model predictions.

We’re going to use Redis as our NoSQL solution for this exercise. While there are similar alternatives to Redis, such as Memcached, Redis is a standard that works across multiple programming languages and has managed implementations across popular cloud infrastructures. Redis is meant to be viewed as an ephemeral cache, which means that it might not be the best approach if your user profiles need long-term persistence. However, if you’re building ML models for targeting new users it’s a great choice.

With NoSQL for data science projects, you typically use the data store to update user profiles based on real-time data being passed to the data product. For streaming data, this endpoint can be set up to process data as REST commands, or be set up to process data streams from tools such as Kinesis and Kafka. We’ll walk through a sample deployment for setting up a REST endpoint that incrementally updates user profiles using Redis.

Redis Deployments

There’s a variety of different options for setting up a Redis instance that will be used to cache data for your machine learning project. Here are some of the different options that are available.

Mock Implementations: fakeredis for Python, jedis-mock for Java
Local Deployments: Build and run locally, or use Docker
Hosted Deployments: Run Redis on a cluster in the Cloud
Managed Deployment: AWS ElastiCache, GCP Memorystore, Redis Cloud

When getting starting with Redis, using a Mock implementation is great for learning the interface to Redis. Also, a mock implementation can be great for setting up unit testing for your application. Once you want to get a read on the potential throughput of your application, it’s good to move to a local instance of Redis running on your machine directly or through Docker. This will enable you to test connecting to Redis, and get a better read on profiling of your application.

Once you want to put your system into production, then you’ll want to move to a Redis cluster, which can be a hosted solution on a cloud platform, where you are responsible for provisioning and monitoring Redis on a cluster of machines, or you can use a managed solution that handles all of the overhead of maintaining a cluster while providing the same Redis interface. Managed solutions are great for getting up and running with Redis in a production application, but there are some factors to consider when choosing a hosted versus managed solution for Redis, or other NoSQL solutions:

What are your latency requirements?
What are your memory requirements?
What are your throughout requirements?

With a hosted solution, you can make sure that the Redis cluster is co-located with your data product to ensure minimal latency between your service and the Redis instances. With GCP Memorystore, you can configure both of these clusters to live within the same availability zone, which results in sub-millisecond latency for Redis commands, but you do lose the ability to configure your instance when using a managed approach.

The likely factor that will determine a hosted versus managed approach is the anticipated cost of the cluster. With Memorystore you are charged per GB per hour, and there are different tiers of pricing based on capacity. There may also be changes for read or write commands, which is part of the pricing for DynamoDB on AWS. If the costs seem reasonable, then using a managed option may be preferred because it can reduce the amount of DevOps required by your team to maintain the cluster. If you have large memory requirements, such as more than 1TB per region, then a managed solution may not be able to scale to your use case.

For this post, we’ll stick to the mock implementation of Redis, to keep everything related to Python coding, and because Redis is a vast topic that readers should explore in more detail beyond this post.

A Real-time Application in Python

To simulate building feature vectors in real time, we’ll use the Kaggle NHL data set. The game_skater_stats.csv file in this data set provides player-level game summaries, such as the number of shots, goals, and assists completed during a game. We’ll read in this file, filter the rows to a single player, and then iterate through the events and send them to a Flask endpoint that will update the user profile. After sending the event, we’ll also call the endpoint to get an updated player score using a simple linear regression model. The goal is to show how a Flask endpoint can be set up to process a streaming data set and serve real-time model predictions. The full code for this post is available as a Jupyter notebook on GitHub.

NHL Game Data

Game, team, player and plays information including x,y coordinates

www.kaggle.com

For readers new to Redis that want to get started with Python, it’s useful to refer to the redis-py documentation for additional details about the Python interface. For this post, we’ll demonstrate basic functionality using a mock Redis server that implements a subset of this interface. To get started, install the following Python libraries:

pip install pandas
pip install fakeredis
pip install flask

Next, we’ll run through CRUD commands using Redis. The code snippet below shows how to start an in-process Redis service, and retrieve a record using the key 12345. Since we haven’t stored any records yet, the print statement will output None. The remainder of the snippet shows how to create, read, update, and delete records.

To create a record in Redis, which is a key-value entry, we can use the set command, which takes key and value parameters. You can think of the key as an index that Redis uses for retrieving the value. The code above checks for the record and if a record is not found a new user profile is created as a dictionary and the object is saved to Redis using the player ID as the key.

Next, we read the record using the get command in Redis, which retrieves the most recent value saved to the data store. If no record is found, then a value of None is returned. We then translate the String value returned from Redis into a dictionary using the json library.

Next, we update the user summary by incrementing the sessions value and then using the set command to save the updated record to Redis. If you run the create, read, update commands multiple times, then you’ll see the session count update with each run.

The last operation shown above is the delete operation, which can be performed using an expiration or by deleting the key. The delete command immediately removes the key-value pair from the data store, while the expire command will delete the pair once the specified number of seconds has passed. Using the expire command is a common use case, because you may only need to maintain recently updated data.

Next, we’ll pull the NHL data set into memory as a Pandas dataframe and then iterate over the frame, translate the rows into dictionaries, and send the events to the endpoint that we’ll set up. For now, you’ll want to comment out the post and get commands, since the server is not yet running.

For this example, we filter the dataframe to rows with the player ID equal to 8467412, which results in 213 records. Once we have the endpoint set up, we can try iterating through all of the records in order to test the performance of the endpoint.

The code snippet below shows the code for a sample Flask application that sets up two routes. The /update route implements the feature engineering workflow in the application and the /score route implements model application using a simple linear model.

The update route uses the CRUD pattern to update user profiles as new data is received, but it does not include a delete or expiration step. As new events are sent to the server, the application will fetch the most recent player summary, update the record with new data points, and then save the updated profile. This means that the profiles are updated in real-time as data is streamed to the server, minimizing the amount of latency in the model pipeline. One thing to note when using this approach for feature engineering is only a subset of features can be used when compared to SQL, since aggregation commands are not available. You can update counters, set flags, calculate means, but can’t perform operations such as calculating median or counting distinct values.

The update route takes a player ID and returns a score based on the retrieved feature vector, if available. It parses the player ID from the query string and then fetches the corresponding player summary. The values in this summary are combined with coefficients from a hard-coded linear regression model to return a model prediction. The result is an endpoint that we can call to get model predictions in real-time using up-to-date data.

We now have a mock service that shows how to perform feature engineering and model application in real-time using Redis as a data store. In practice, we’d swap out the mock Redis implementation with a Redis cluster, use a model store for the ML model to apply, run the Flask application using a WSGI server such a gunicorn, and scale the application using tools such as Docker and Kubernetes. There’s lots of different approaches for putting this workflow into production, but we’ve demonstrated the core loop with a simple service.

Conclusion

NoSQL tools are useful for data scientists to explore when they need to move from batch to streaming workflows for building predictive models. By using NoSQL data stores, the latency in a workflow can be reduced from minutes or hours to just seconds, enabling ML models to use up-to-date inputs. While data scientists may not typically be hands on with these types of data stores, there’s a variety of ways of getting hands on with tools such as Redis, and it’s possible to prototype services even within Jupyter notebooks.

There’s a variety of different concerns that arise when scaling a service for real-time feature engineering with Redis. One concern is concurrent updates to a key, which can occur when different threads or servers are updating a profile. This can be handled with the check-and-set pattern. On the model application side, there are also issues such as model maintenance and it’s common to have different services for building the feature vectors and for applying models.

Redis and other NoSQL solutions are just one way of implementing real-time feature engineering for data science workflows. Another approach is to use streaming systems such as Kafka or Kinesis, and stream processing tools such as Kinesis Analytics or Apache Flink. It’s good to explore different approaches in order to find the best solution that will work for your organizations data platform and services.

Ben Weber is a distinguished data scientist at Zynga. We are hiring!