Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

April 25th, 2017   The Kinesis Client Library provides convenient abstractions for interacting with Amazon Kinesis. Consumer checkpoints are automatically tracked in DynamoDB (Kinesis checkpointing) and it’s easy to spawn workers to consume data from each shard (Kinesis term for a partition) in parallel. For those unfamiliar with checkpointing in streaming applications, it is the process of tracking which messages have been successfully read from the stream. Spark Streaming implements a receiver using the Kinesis Client Library to read messages from Kinesis. Spark also provides a utility called checkpointing (Spark checkpointing; not to be confused with Kinesis checkpointing in DynamoDB) which helps make applications fault-tolerant. Using Spark checkpointing in combination with Kinesis checkpointing provides at-least-once semantics. When we tried to implement the recommended solution using Spark checkpointing, it was very difficult to develop any code without breaking our checkpoints. When Spark saves checkpoints, it serializes the classes which define the transformations and then uses that to restart a stopped stream. If you then change the structure of one of the transformation classes, checkpoints become invalid and cannot be used for recovery. (There are ways to make code changes without breaking your application’s checkpoints, however in my opinion they add unnecessary complexity and risk to the development process as cited in this example). This challenge, in combination with a sub-optimal at-least-once guarantee, led us to abandon Spark checkpointing to pursue a simpler, albeit somewhat hacky, alternative. Every message sent to Kinesis is given a partition key. The partition key determines the shard to which the message is assigned. Once a message is assigned to a shard, it is given a sequence number. Sequence numbers within a shard are unique and increase over time. (If the producer is leveraging message aggregation, it is possible for multiple consumed messages to have the same sequence number) When starting up a Spark...

Spark Geospatial Analysis with B23 Data Platform

August 30th, 2016 As a member of the B23 Data Platform development and data science team, we’ve been excited to continue to launch new innovative and secure features that allow Data Scientists to process data more effectively and quickly than previously possible. We launched B23 Data Platform in early 2016 as a data science orchestration platform. B23 Data Platform is both a marketplace for data as well as marketplace of big data Stacks that can be provisioned in minutes. Most automated provisioning tools are just a “blank canvas,” but with B23 Data Platform you have access to both data sets as well as secure Stacks in the cloud. Using B23’s EasyIngest capability, data scientists are only a few mouse clicks and several minutes away from analyzing data in a securely provisioned Stack. Recently, I had the opportunity to work on a project that highlights the capabilities of B23 Data Platform — geospatial analysis using Apache Spark. This included using a large dataset containing network-based location data in geographic coordinates. Specifically, this dataset contained over a billion rows of latitude and longitude metrics with timestamps over a multi-year period. The challenge was to figure out how many of these “locations” map to certain points of interest (“POI”) each day using this initial, raw, dataset. I was able to complete this geo-spatial enrichment task in the following 5 steps: Acquire POI data Determine how to join the data sets Transform the data Geo-fence the POI data Run spatial join Acquire POI Data My first step was to pull a second dataset containing geospatial data for a particular POI. We used POI data that contains the locations for 1000+ restaurant chains in North America. In one case, I downloaded the data for the restaurant Chipotle, which has 2,076 locations (as of 6/13/16).   Determine How to Join Datasets Now that I had acquired my datasets, I needed a plan to join them. The first dataset contains about 6.5 million records per day. After doing some quick math, I realized that joining each of these...

The Modern Data Economy — Part III of III: Scaling in the Data Economy

May 12th, 2016 Iterative Spark Development Previously, we discussed how Apache Spark and its SparkSQL module was a key component to ingesting diverse data sources using the DataSources and DataFrame API for a customer project. As we began picking off individual data sources, the number of Spark jobs began to increase significantly. We experimented with a fewer number of larger jobs, and conversely a larger number of smaller jobs. All jobs were written in Scala, and version controlled in an enterprise GitHub environment. We deployed jobs to the production environment using our favorite automation technology foregoing Jenkins for the time being. Understanding the differences between Spark Resilient Distributed Dataset (“RDD”) and DataFrames was fairly strait forward since several of us had worked with Python’s Panda library in the past, and we all knew Spark’s RDD concepts. We have since become intimately knowledgeable with both the DataFrame and DataSources API’s as we read, joined, and wrote data from a variety of platforms including MySQL, AWS Aurora, Kafka, Hive, HDFS, and Amazon Redshift. We are particularly excited about our custom-developed SQL ingest program written in Scala that is capable of ingesting data from 40+ different databases all with various schema’s. It writes data directly to Hive several magnitudes faster than the prior Apache Sqoop implementation. This library “discovers” schema in a relational datastore, creates an appropriate Hive table, and read-writes the data in a parallel fashion using a selectable partition column. Tuning for parallelization across many Spark executors was critically important for success. The core code is approximately 25 lines of Scala, and is capable of using both the MySQL and MariaDB JDBC drivers. Our Scala code to write from Hive to Redshift is even smaller, and extremely performant when parallelized across many Spark executors.   Scaling Out with YARN We expect that during peak periods hundreds of parallelized Spark jobs will be running and managed by YARN. Running Spark on YARN...

The Modern Data Economy — Part II of III: Tools of the Data Economy

May 11th, 2016 January 2013 Introduction to Spark and Shark At the heart of our data transformation and productization approach are a small number of important tools including Apache Spark and the nascent Apache Airflow (incubating). I first heard of Apache Spark in January 2013 while working at Amazon Web Services (“AWS”). I was part of the team that launched Apache Accumulo on Elastic MapReduce (“EMR”). During this time I was able to collaborate with a team member who was implementing a similar capability with Spark and Shark. At that time, Spark and Shark were both UC Berkeley AMPLab projects, and their similar sounding names confused more than a handful of people. In the years since, and continuing into the present, there have continued to be many misconceptions about Spark. Some of these misconceptions include Spark only working with in-memory data sets, that Spark was only intended for machine learning, that Spark could not scale like MapReduce, etc. Spark has grown so fast and quickly, that we believe some of its best capabilities are not well understood. Spark is not perfect, but in our real-world implementation experience we have found Spark to be one of the most ubiquitous, effective solutions for almost every conceivable data requirement. It has become the Swiss Army Knife for the Data Economy. Spark and Its Subsystems Early on, we made a deliberate decision early to use Scala for all of our Spark development work. We had years of experience tuning Java Virtual Machines (“JVM’s”), and most of us had functional programming backgrounds due to a recent immersion in the Clojure programming language. Unrelated to Spark, we had recently run into Python’s global interpreter lock (“GIL”) when operating on a petabyte-scale ingest project, and we definitely wanted to make sure we did not find ourselves in that boat again anytime soon. We still use Python quite a bit, but Scala was an easy choice overall with several positive outcomes detailed below. Spark’s MLlib Machine Learning capabilities get the lion’s share of attention these...

The Modern Data Economy — Part I of III: Status of the Data Economy

May 10th, 2016 Dave Hirko is a Managing Director and Co-Founder of B23. Prior to co-founding B23, Dave was an Account Executive at Amazon Web Services. Industrial Revolution Meets Modern Data Economy When we started B23 several years ago, our initial focus was to enable a variety of Hadoop distributions on the Cloud, specifically on AWS since that was our background and experience at the time. During this period, we met with many executives at the leading Hadoop companies that wished us luck, and inferred that running Hadoop on the Cloud was likely just a niche. They said bare metal was the only way to go. Fast forward several years (and over a billion dollars of investment in this area overall), and we estimate that a significant number of customers are running Hadoop, Spark, and other distributed processing applications in the Cloud. Entire “unicorn” software companies have bet their business model running these platforms on the Cloud. The mainstream economy is following suite. The Cloud combined with the appropriate level of automation is a more secure, agile, and competitive platform even for the most data sensitive organizations. The top players in the Data Economy understand this premise, and have embraced it totally. The path forward is definitely not for the feint of heart, and having a capable guide on this journey is important. Scaling technology is hard and important, and finding great people is even harder and more important. Our company has been fortunate to have worked on some of most cutting-edge data modernization projects in the world. Our backgrounds prior to B23 share that commonality. We have continued to attract and retain many of the best software developers and data scientists in the country, and we continue to seek out the best. Our amazing team, a pragmatic realization of how software can enable the modern data enterprise, and understanding the pulse of our customers has allowed us to become market leaders in this new economy. B23’s roots started with challenging preconceived notions of “viability,”...