All of your statistics knowledge, modeling mojo, domain expertise, and hard-won Apache Spark skills have paid off. You've trained a machine-learning (ML) model on months of customer data and it's performing well. You've “soft-trialed” it for recent batches of data and validated everything, and it's time to let the model drive a new product for your business.
But there's one more important question facing you and your business: how will you deploy your Spark ML project?
This article is here to help you answer that question. We'll take a close look at the ways Spark ML models can be put into production, which patterns work best in which situations, and why.
Based on Spark's job execution mechanisms, there are a few guiding principles to keep in mind when working on a Spark ML project. When considered in conjunction with your organization's needs and resources, these principles can help you choose a successful deployment strategy.
Given these guiding principles, there are several ways you can deploy a Spark ML model. In this article, we'll look at all of the widely used approaches, broken down into a few handy categories.
This approach is the simplest one to think about. Just as you wrote code to train a model on a Spark cluster, you can write code to perform predictions on a cluster. Using spark-submit and/or a job scheduler, you can run a batch job (perhaps every 5 minutes, 60 minutes, or 24 hours) wherein you create a DataFrame of inbound records, unpersist the Spark ML model (via Spark ML Pipelines persistence), call “transform” to do the prediction, and write out the results.
In this pattern, the sizing and composition of the Spark cluster should be based on the volume of data that you are predicting on. Depending on your needs, it may be a very different cluster from the one you needed to train the model.
The batch-predict pattern is perfectly straightforward and is ideal for performing parallel predictions on large numbers of records using Spark's scale-out architecture. But if you need continuous predictions rather than occasional processes run on a scheduled basis, this deployment model will not be appropriate.
In this pattern, we avoid the scheduled spin-up of a Spark cluster and instead keep a Spark app running all the time, with our ML model loaded up and ready to go. Clients submit data for prediction, we perform prediction on that record or group of records, and the date gets returned to the client through an outbound channel.
There are several variants of this pattern, but what they all have in common is that we're scheduling a Spark job for each batch of predictions, and that job incurs several hundred milliseconds (at least) of overhead. Thus, this approach is not suitable for “effectively real-time” predictions.
For example, if you are trying to predict fraudulent commerce transactions or calculate ad bids and your SLA requires a response in 15 milliseconds, this pattern will not work. On the other hand, delayed scenarios like spam classification or data mining for email marketing would likely work fine.
Variants of this approach include:
Of these variants, I favor approach 3, using a pool of local model instances and a REST (or custom) job server for operational simplicity. I prefer to avoid the complexity of streaming if I don't absolutely need it, and operating a stateless prediction service using containerized Spark local mode instances is more straightforward than scaling clusters (given that we don't really need a cluster here).
As mentioned earlier, once a model is built, it does not in principle depend on a clustered environment to perform individual predictions. The greatest advantage of running a model outside of Spark is that we can perform predictions with lower latency, allowing us to meet the small-millisecond SLAs for cases like fraud detection or ad bidding mentioned above.
Again, there are several distinct approaches to running Spark ML models outside of Spark.
First, we can re-implement the model in another language or environment. We extract properties of the model (e.g., weights and intercepts for linear regression, or splits and weights in a tree ensemble) and use those to implement the same math in our web service platform (e.g., Node.js). This pattern is straightforward but has some limitations; some models could be quite complicated to re-implement. Moreover, as the model is retrained in the future, the new iterations will need to be implemented on the web service side, eventually necessitating an infrastructure to test, convert, and deploy models automatically.
Enter Predictive Modeling Markup Language (PMML), a cross-platform standard for representing machine learning models. Most Spark ML models can be exported — or converted via other libraries — to PMML, where they are represented by an XML-based document that is independent of Spark. The PMML description can then be loaded and executed as part of a separate application, which could contain other features or merely function as a prediction service.
For example, the JPMML project provides open-source Java code for importing and running PMML models as well as for converting Spark ML Pipelines to PMML; the related Openscoring.io offers pre-built services for prediction on a variety of platforms; and Pipeline.io provides a flow for model conversion and serving along with a suite of supporting infrastructure.
An alternative is to run actual Spark models (i.e., persisted ML Pipelines) outside of Spark via a helper library that “resembles” Spark APIs. At Spark Summit Europe 2016, Hollin Wilkins and Mikhail Semeniuk presented MLeap and Combust.ML, projects that aim to do exactly that[6.
We are still somewhat in the early days for this set of approaches, but the level of interest and variety of technology being brought to bear on the problem suggest it will mature rapidly. These options, taken together, should allow you to select a deployment plan for your Spark ML model in a real-time online prediction service with a minimum of operational overhead.
Apache Spark has proven to be an excellent environment for building machine learning models using high-level constructs as well as training with datasets that cannot be accommodated on a single server.
The deployment strategies presented here are intended to fill the gap that arises when we need to operationalize those models in a different computing environment (Spark or otherwise) in order to meet business requirements.