The Modern Data Economy — Part III of III: Scaling in the Data Economy

May 12th, 2016 Iterative Spark Development Previously, we discussed how Apache Spark and its SparkSQL module was a key component to ingesting diverse data sources using the DataSources and DataFrame API for a customer project. As we began picking off individual data sources, the number of Spark jobs began to increase significantly. We experimented with a fewer number of larger jobs, and conversely a larger number of smaller jobs. All jobs were written in Scala, and version controlled in an enterprise GitHub environment. We deployed jobs to the production environment using our favorite automation technology foregoing Jenkins for the time being. Understanding the differences between Spark Resilient Distributed Dataset (“RDD”) and DataFrames was fairly strait forward since several of us had worked with Python’s Panda library in the past, and we all knew Spark’s RDD concepts. We have since become intimately knowledgeable with both the DataFrame and DataSources API’s as we read, joined, and wrote data from a variety of platforms including MySQL, AWS Aurora, Kafka, Hive, HDFS, and Amazon Redshift. We are particularly excited about our custom-developed SQL ingest program written in Scala that is capable of ingesting data from 40+ different databases all with various schema’s. It writes data directly to Hive several magnitudes faster than the prior Apache Sqoop implementation. This library “discovers” schema in a relational datastore, creates an appropriate Hive table, and read-writes the data in a parallel fashion using a selectable partition column. Tuning for parallelization across many Spark executors was critically important for success. The core code is approximately 25 lines of Scala, and is capable of using both the MySQL and MariaDB JDBC drivers. Our Scala code to write from Hive to Redshift is even smaller, and extremely performant when parallelized across many Spark executors.   Scaling Out with YARN We expect that during peak periods hundreds of parallelized Spark jobs will be running and managed by YARN. Running Spark on YARN...

B23 Data Platform: Enabling Data

February 12th, 2016 I’ve spent my career working on data problems and building data products. During this time, I’ve spent a lot more time fumbling with infrastructure and moving data than I’ve spent performing value-add data science or data analytics. I learned to live with those frustrations, and my customers learned to live with the amount of time it took to ask a question of a new dataset. As a result of these experiences, I’m very excited to introduce B23 Data Platform to people like me and stakeholders like mine. In our previous post (Welcome to B23 Data Platform, the Next Big Thing in Big Data) we introduced several of the main concepts for the rationale of why we built B23 Data Platform. In this blog, I’ll show you you how we use them. As a data scientist, I should not have to have to worry about arcane items like private subnets, bastion hosts, and internet gateways. However, somebody in my enterprise does worry about those things. With B23 Data Platform, I have the confidence that I’m protecting my resources without getting in the way of completing productive work. A Space embodies the set of secure infrastructure resources within your preferred cloud provider. These secure cloud resources will host your data pipeline(s). We use industry best practices to lock down these assets, and we create these assets in your cloud account. You have complete control and transparency over your resources and data. To create a Space you simply login, choose your cloud provider, give the Space a name, and enter your credentials. A few minutes later, you have a running Space. It is that simple. Steps to Create a Space The big data and data science ecosystem has become very crowded with tools. Tools exists for a reason. Some are easy to differentiate, while the distinction between others is much more nuanced; this offers both challenges and opportunity. The challenge stems from the paradox of choice. However, the opportunity exists to build a data application that is tailored exactly to your business needs. We enable data scientists by...