May 11th, 2016

January 2013 Introduction to Spark and Shark

At the heart of our data transformation and productization approach are a small number of important tools including Apache Spark and the nascent Apache Airflow (incubating). I first heard of Apache Spark in January 2013 while working at Amazon Web Services (“AWS”). I was part of the team that launched Apache Accumulo on Elastic MapReduce (“EMR”). During this time I was able to collaborate with a team member who was implementing a similar capability with Spark and Shark. At that time, Spark and Shark were both UC Berkeley AMPLab projects, and their similar sounding names confused more than a handful of people. In the years since, and continuing into the present, there have continued to be many misconceptions about Spark. Some of these misconceptions include Spark only working with in-memory data sets, that Spark was only intended for machine learning, that Spark could not scale like MapReduce, etc. Spark has grown so fast and quickly, that we believe some of its best capabilities are not well understood. Spark is not perfect, but in our real-world implementation experience we have found Spark to be one of the most ubiquitous, effective solutions for almost every conceivable data requirement. It has become the Swiss Army Knife for the Data Economy.

Spark and Its Subsystems

Early on, we made a deliberate decision early to use Scala for all of our Spark development work. We had years of experience tuning Java Virtual Machines (“JVM’s”), and most of us had functional programming backgrounds due to a recent immersion in the Clojure programming language. Unrelated to Spark, we had recently run into Python’s global interpreter lock (“GIL”) when operating on a petabyte-scale ingest project, and we definitely wanted to make sure we did not find ourselves in that boat again anytime soon. We still use Python quite a bit, but Scala was an easy choice overall with several positive outcomes detailed below.
Spark’s MLlib Machine Learning capabilities get the lion’s share of attention these days, and we are currently using Spark MLlib on several customer projects including with one of the East Coast’s premier data science organizations. Apache Zeppelin is a central part of that project as well.
What has really impressed us, and that does not get as much attention, is Spark’s DataSource and DataFrame application programming interfaces (“API’s”). At the time of writing, these are grouped into the SparkSQL module. These API’s are hidden gems, and it’s not entirely intuitive that they would be found inside the SparkSQL module. SparkSQL has its roots with the previously discussed Shark project.


Durable, Fast Data Pipelines using Spark DataSources API

Building durable and fast data pipelines is a critical prerequisite for analyzing data. Unfortunately, in our opinion, the term “data science” has become too synonymous with just analyzing data, when in fact acquiring, transforming, and reliably storing data are equally critical aspects. The modern enterprise acquires data from so many different sources that a multitude of Extract-Transform-Load (“ETL”) tools are usually required to just get data into a format so it can be analyzed. In one on-going customer project where B23 is enabling the data transformation of an entire enterprise, the legacy ETL process was comprised of dozens of shell, python, and ruby scripts launched from Jenkins’ servers. The aggregate data was near petabyte scale, and having so many diverse scripts running on Jenkins was not a scalable approach to taming this environment.
With this particular project we are collecting data from several dozen relational databases all running on AWS, an on-premise Hadoop cluster repository that is being migrated to AWS, Apache Kafka with an Apache Storm topology, Hbase, Cassandra, and several “drop zones” where data is placed in a specific location by third parties on a period basis.
A single “logical” data repository was required to model fact and dimension tables for business analysts, with the ultimate goal to enable more sophisticated machine learning and gain deeper insight into their product offerings. We use the term “logical” since we store some data directly on HDFS and some on S3 with an appropriately configured Hive metastore to query both sources. This hybrid approach allowed us to recognize cost savings for higher-latency data queries. Enterprise business intelligence (“BI”) users had previously used Hive to perform ad-hoc queries on the entire corpus of data, and connected their BI visualization tools to a MySQL data store for querying a subnet of the data. Both approaches were slow, and in need of an upgrade. Scaling MySQL with FusionIO cards could only go so far. In the new enterprise data platform, Hive-on-Spark is the ad-hoc query tool, and AWS Redshift populated by Spark jobs replaces the MySQL datastore for supporting business intelligence (“BI”) reporting. The difference in speed between Redshift and MySQL is night and day.

In Part III will we discuss Scalability in the Modern Data Economy.


Dave Hirko is a Managing Director and Co-Founder of B23. Prior to co-founding B23, Dave was an Account Executive at Amazon Web Services.