The Modern Data Economy — Part III of III: Scaling in the Data Economy

May 12th, 2016 Iterative Spark Development Previously, we discussed how Apache Spark and its SparkSQL module was a key component to ingesting diverse data sources using the DataSources and DataFrame API for a customer project. As we began picking off individual data sources, the number of Spark jobs began to increase significantly. We experimented with a fewer number of larger jobs, and conversely a larger number of smaller jobs. All jobs were written in Scala, and version controlled in an enterprise GitHub environment. We deployed jobs to the production environment using our favorite automation technology foregoing Jenkins for the time being. Understanding the differences between Spark Resilient Distributed Dataset (“RDD”) and DataFrames was fairly strait forward since several of us had worked with Python’s Panda library in the past, and we all knew Spark’s RDD concepts. We have since become intimately knowledgeable with both the DataFrame and DataSources API’s as we read, joined, and wrote data from a variety of platforms including MySQL, AWS Aurora, Kafka, Hive, HDFS, and Amazon Redshift. We are particularly excited about our custom-developed SQL ingest program written in Scala that is capable of ingesting data from 40+ different databases all with various schema’s. It writes data directly to Hive several magnitudes faster than the prior Apache Sqoop implementation. This library “discovers” schema in a relational datastore, creates an appropriate Hive table, and read-writes the data in a parallel fashion using a selectable partition column. Tuning for parallelization across many Spark executors was critically important for success. The core code is approximately 25 lines of Scala, and is capable of using both the MySQL and MariaDB JDBC drivers. Our Scala code to write from Hive to Redshift is even smaller, and extremely performant when parallelized across many Spark executors.   Scaling Out with YARN We expect that during peak periods hundreds of parallelized Spark jobs will be running and managed by YARN. Running Spark on YARN...