top of page
  • Writer's picturebrad058

The Modern Data Economy — Part III of III: Scaling in the Data Economy

May 12th, 2016

Iterative Spark Development

Previously, we discussed how Apache Spark and its SparkSQL module was a key component to ingesting diverse data sources using the DataSources and DataFrame API for a customer project. As we began picking off individual data sources, the number of Spark jobs began to increase significantly. We experimented with a fewer number of larger jobs, and conversely a larger number of smaller jobs. All jobs were written in Scala, and version controlled in an enterprise GitHub environment. We deployed jobs to the production environment using our favorite automation technology foregoing Jenkins for the time being. Understanding the differences between Spark Resilient Distributed Dataset (“RDD”) and DataFrames was fairly strait forward since several of us had worked with Python’s Panda library in the past, and we all knew Spark’s RDD concepts. We have since become intimately knowledgeable with both the DataFrame and DataSources API’s as we read, joined, and wrote data from a variety of platforms including MySQL, AWS Aurora, Kafka, Hive, HDFS, and Amazon Redshift.

We are particularly excited about our custom-developed SQL ingest program written in Scala that is capable of ingesting data from 40+ different databases all with various schema’s. It writes data directly to Hive several magnitudes faster than the prior Apache Sqoop implementation. This library “discovers” schema in a relational datastore, creates an appropriate Hive table, and read-writes the data in a parallel fashion using a selectable partition column. Tuning for parallelization across many Spark executors was critically important for success. The core code is approximately 25 lines of Scala, and is capable of using both the MySQL and MariaDB JDBC drivers. Our Scala code to write from Hive to Redshift is even smaller, and extremely performant when parallelized across many Spark executors.

Scaling Out with YARN

We expect that during peak periods hundreds of parallelized Spark jobs will be running and managed by YARN. Running Spark on YARN was an obvious choice given the underlying Hadoop infrastructure we already put in place. We have not reached the nirvana of dynamic allocation with Spark on YARN, and still have to explicitly allocate cluster resources for different Spark jobs. For several of us, it felt like tuning C programs to allocate and deallocate memory, just at a different scale. On the other hand, we have observed significant performance gains from Spark’s relatively new capability supporting rack-awareness which was a pleasant surprise.

Apache Airflow

Running hundreds of Spark jobs, understanding their dependencies, monitoring performance, and catching exceptions is a logistical challenge for any enterprise. We have standardized on a job scheduler that originated from Airbnb called Apache Airflow. We evaluated several job schedulers and had worked with Azkaban prior, and found Airflow possessed the most potential to meet our diverse set of needs. As developers, modeling your DAG’s as Python code versus a markup language was appealing. Scaling Airflow was challenging as we could not find any documentation either in user forums or online. Since embracing Airflow, it has become part of the Apache ecosystem which we believe will benefit the project tremendously. There are many undocumented “tricks” we have found in executing tasks and configuring Airflow only realized through trial-and-error. We hope to have a chance to contribute these back to the Airflow community. We have since scaled out our Airflow implementation to use Python’s celery library for distributed task execution, and use a backend Amazon RDS instance as the data store.

B23 Data Platform— Scaling the Productization Process

All of the best practices we have developed and incorporated into our solutions are based on real-world implementation experience. As we codify those best practices into our automation software, we strengthen the “productization” process making it more anti-fragile. While assisting several of our customers graduate to the Elite Data tier, we decided a new scalable software solution was needed to help customers productize solutions. We needed software to do what our engineering services were doing already — building scalable analytical processing systems with durable data pipelines. We needed the software to do it faster and more consistently than a human. From this idea B23 Data Platform was born. B23 Data Platform is a software orchestration platform accessible to anyone on the web, but with the twist that the data and resources are located in our customer’s environment. B23 Data Platform only communicates via standard Cloud API’s and standard automation techniques to securely orchestrate entire analytical environments. Customer data is never outside the scope of control of our customers’ environment. Every aspect of these orchestrated solutions are accessible to customers. This allows for complete transparency with respect to security and data access which is mandatory within the Data Economy.

B23 Data Platform is also a marketplace to launch scalable distributed processing solutions like Apache Spark, Apache Zeppelin, Apache Hadoop, Elasticsearch, H2o, Streamsets and a number of other tools securely in minutes, and with only a few mouse clicks.

Once these stacks are running, ingesting data is as easy as a few more mouse clicks using B23 EasyIngest capability. EasyIngest automatically creates functional data pipelines for a variety of data sources and data types, and allows users to pick which running stacks the data should flow. It’s an extremely powerful and convenient process to use what were once complex systems to build, configure, and operate.

Once you are finished analyzing your data in your environment, simply stop the stack and stop paying for unused resources. When you are ready to start back up, it’s only a few mouse clicks and minutes to build an entirely new environment.

Our best practices were not developed in a void or from watching others, but through a very diverse set of customers and use cases where common challenges and solutions emerged. B23 Data Platform is making it even easier for our customers to more effectively compete using their data.

See for yourself at no additional cost at

Dave Hirko is a Managing Director and Co-Founder of B23. Prior to co-founding B23, Dave was an Account Executive at Amazon Web Services.

#MA #ArtificialIntelligence #DataScience #MachineLearning #ApacheSpark #ApacheAirflow

10 views0 comments

Recent Posts

See All
Post: Blog2_Post
bottom of page