Moving Beyond Hadoop: Enterprise Data Science







Most large organizations have a substantial investment in Hadoop/Spark, as well as in other related data collection and storage systems. These JVM-based systems may continue to support ETL and basic analytic use cases, but when it comes to more advanced predictive analytics and line-of-business machine learning applications (e.g., fraud/anomaly detection), they often fail to provide value, obstruct IT evolution, and hold back the overall business.

There are both technical and business issues:

On the tech side, Spark tuning and extension is complex for experienced data engineers and has proven to be beyond the abilities of most data science teams. And, even assuming data features are successfully extracted, the Spark/Hadoop ecosystem offers only a minimal and self-contained set of functionality for sophisticated analysis: e.g., you don’t need (or want) a billion records to train a linear model; while if you want a higher capacity model like XGBoost or deep neural nets, Spark/Hadoop doesn’t help. It literally adds cost/complexity while not offering any additional effectiveness.

The business issues meanwhile, are equally challenging. The commercial support ecosystem for OSS Hadoop/Spark is collapsing. In 2019, Hortonworks merged into Cloudera, which itself has never made a profit. This led to depressed stock and talent flight. MapR was never large and was (also in 2019) dissolved into HP with no commitment to future OSS development. Databricks is succeeding but offers support only to their custom software customers; fundamentally a SaaS software company, rather than an OSS support company. They have already begun separating their business from the OSS/Spark ecosystem.

"Third-party" support/consulting organizations like Accenture and Booz-Allen are not particularly effective in this area and are not in a position to offer a dynamic future for this tech.

The tech is limited and commercial support (required for most large orgs to use OSS) is rapidly approaching a dead-end. Meanwhile, the power and business opportunity for machine learning/AI is only growing and is dominated by more open (non-profit-supported) Python-based tooling.

What is a practical path forward?

Two major technology paths are empowering data scientists to use the skills they already have in order to address vastly larger datasets (as well as more advanced modeling techniques).

The first -- in wide use for large-scale scientific computing and now progressing in the enterprise -- is Dask. Dask is a simple but powerful scheduler that scales Pythonic approaches from a single laptop to thousands of compute nodes, and does so in a more user friendly way -- and a more data scientist friendly way!

The second is the widespread adoption of GPU (originally named for their role as graphics processing units) computation for data science and analytics. GPU compute allows orders-of-magnitude speedups on analytics and modeling tasks at lower costs and, and when coupled with Python interfaces, are now easier for data scientists, analysts, and engineers to use. NVIDIA, as the preeminent maker of GPU hardware, dominates this market, and their chips and "RAPIDS" software tooling are replacing Intel-based ones where performance is important.

Even better, these two sets of tools -- Dask and Python-enabled GPU compute -- can work together as part of a single ecosystem. Together with emerging GPU-based SQL tools, they form the heart of the next generation of data and AI computation. Well known enterprises that train with ProTech have already adopted this combination of Dask+RAPIDS+GPU to accomplish more work, with many fewer headaches, and at a lower cost than their previous Hadoop-based systems.

Join ProTech and our Data Science specialist, Adam Breindel for a short webinar as he takes us along a journey of “Moving beyond Hadoop: bigger, faster, easier enterprise big data”. Adam brings with him over 20 years in the industry and holds a passion for his teachings and sharing his expertise.