All of your statistics knowledge, modeling mojo, domain expertise, and hard-won Apache Spark skills have paid off. You've trained a machine-learning (ML) model on months of customer data and it's performing well. You've “soft-trialed” it for recent batches of data and validated everything, and it's time to let the model drive a new product for your business.
But there's one more important question facing you and your business: how will you deploy your Spark ML project?
This article is here to help you answer that question. We'll take a close look at the ways Spark ML models can be put into production, which patterns work best in which situations, and why.
Spark Execution and Spark ML: Key Principles
Based on Spark's job execution mechanisms, there are a few guiding principles to keep in mind when working on a Spark ML project. When considered in conjunction with your organization's needs and resources, these principles can help you choose a successful deployment strategy.
- Performing predictions with a Spark ML model is typically a very lightweight operation for each incoming data point, taking milliseconds or even less time to complete.
- Running a Spark job (e.g., making predictions on a data set and then moving those predictions somewhere) on a running cluster is significantly more expensive (perhaps by hundreds of milliseconds up to seconds), mainly due to job scheduling overhead.
- Starting up a new Spark application cluster (or a new SparkContext/SparkSession) is even more expensive, taking up anything from multiple seconds to tens of seconds.
- Spark is optimized for throughput on big data computations, so on a suitable cluster, predicting for 1 million rows may be just as quick as predicting for 1,000 rows. Predicting in bulk makes sense for Spark — but it may or may not make sense for your business.
- There is nothing conceptually essential about a Spark ML model that requires it to be executed in Spark or on a cluster, and there are additional deployment possibilities.
Given these guiding principles, there are several ways you can deploy a Spark ML model. In this article, we'll look at all of the widely used approaches, broken down into a few handy categories.
Batch Mode Bulk Prediction on a Spark Cluster
This approach is the simplest one to think about. Just as you wrote code to train a model on a Spark cluster, you can write code to perform predictions on a cluster. Using spark-submit and/or a job scheduler, you can run a batch job (perhaps every 5 minutes, 60 minutes, or 24 hours) wherein you create a DataFrame of inbound records, unpersist the Spark ML model (via Spark ML Pipelines persistence), call “transform” to do the prediction, and write out the results.
In this pattern, the sizing and composition of the Spark cluster should be based on the volume of data that you are predicting on. Depending on your needs, it may be a very different cluster from the one you needed to train the model.
The batch-predict pattern is perfectly straightforward and is ideal for performing parallel predictions on large numbers of records using Spark's scale-out architecture. But if you need continuous predictions rather than occasional processes run on a scheduled basis, this deployment model will not be appropriate.
Exposing a Prediction Service on a Long-Running Spark Application
In this pattern, we avoid the scheduled spin-up of a Spark cluster and instead keep a Spark app running all the time, with our ML model loaded up and ready to go. Clients submit data for prediction, we perform prediction on that record or group of records, and the date gets returned to the client through an outbound channel.
There are several variants of this pattern, but what they all have in common is that we're scheduling a Spark job for each batch of predictions, and that job incurs several hundred milliseconds (at least) of overhead. Thus, this approach is not suitable for “effectively real-time” predictions.
For example, if you are trying to predict fraudulent commerce transactions or calculate ad bids and your SLA requires a response in 15 milliseconds, this pattern will not work. On the other hand, delayed scenarios like spam classification or data mining for email marketing would likely work fine.
Variants of this approach include:
- Spark Streaming: Inbound records come from a streaming source, such as Kafka, S3, HDFS, or Kinesis; prediction is performed for each micro-batch; and the output is written to, e.g., another Kafka topic or Cassandra for client consumption.
- Jobserver: A REST service is exposed via Spark Jobserver[1], Livy[2], or your own custom code in front of Spark. Clients make requests for predictions and receive either an immediate response or a token they can use to retrieve prediction results later.
- Spark Local Mode plus Streaming/Jobserver: This approach is similar to options 1 and 2 above except that instead of having one or more multi-node Spark clusters, we define an elastic, horizontally scaling pool of single-node Spark local mode instances (i.e., Spark driver with no separate executors), and we load balance requests across this pool of machines.
Of these variants, I favor approach 3, using a pool of local model instances and a REST (or custom) job server for operational simplicity. I prefer to avoid the complexity of streaming if I don't absolutely need it, and operating a stateless prediction service using containerized Spark local mode instances is more straightforward than scaling clusters (given that we don't really need a cluster here).
Building a Prediction Service Without Running Spark
As mentioned earlier, once a model is built, it does not in principle depend on a clustered environment to perform individual predictions. The greatest advantage of running a model outside of Spark is that we can perform predictions with lower latency, allowing us to meet the small-millisecond SLAs for cases like fraud detection or ad bidding mentioned above.
Again, there are several distinct approaches to running Spark ML models outside of Spark.
First, we can re-implement the model in another language or environment. We extract properties of the model (e.g., weights and intercepts for linear regression, or splits and weights in a tree ensemble) and use those to implement the same math in our web service platform (e.g., Node.js). This pattern is straightforward but has some limitations; some models could be quite complicated to re-implement. Moreover, as the model is retrained in the future, the new iterations will need to be implemented on the web service side, eventually necessitating an infrastructure to test, convert, and deploy models automatically.
Enter Predictive Modeling Markup Language (PMML), a cross-platform standard for representing machine learning models. Most Spark ML models can be exported — or converted via other libraries — to PMML, where they are represented by an XML-based document that is independent of Spark. The PMML description can then be loaded and executed as part of a separate application, which could contain other features or merely function as a prediction service.
For example, the JPMML project provides open-source Java code for importing and running PMML models as well as for converting Spark ML Pipelines to PMML[3]; the related Openscoring.io offers pre-built services for prediction on a variety of platforms[4]; and Pipeline.io provides a flow for model conversion and serving along with a suite of supporting infrastructure[5].
An alternative is to run actual Spark models (i.e., persisted ML Pipelines) outside of Spark via a helper library that “resembles” Spark APIs. At Spark Summit Europe 2016, Hollin Wilkins and Mikhail Semeniuk presented MLeap and Combust.ML, projects that aim to do exactly that[6[7].
We are still somewhat in the early days for this set of approaches, but the level of interest and variety of technology being brought to bear on the problem suggest it will mature rapidly. These options, taken together, should allow you to select a deployment plan for your Spark ML model in a real-time online prediction service with a minimum of operational overhead.
Conclusion
Apache Spark has proven to be an excellent environment for building machine learning models using high-level constructs as well as training with datasets that cannot be accommodated on a single server.
The deployment strategies presented here are intended to fill the gap that arises when we need to operationalize those models in a different computing environment (Spark or otherwise) in order to meet business requirements.