Spark Geospatial Analysis with B23 Data Platform

August 30th, 2016 As a member of the B23 Data Platform development and data science team, we’ve been excited to continue to launch new innovative and secure features that allow Data Scientists to process data more effectively and quickly than previously possible. We launched B23 Data Platform in early 2016 as a data science orchestration platform. B23 Data Platform is both a marketplace for data as well as marketplace of big data Stacks that can be provisioned in minutes. Most automated provisioning tools are just a “blank canvas,” but with B23 Data Platform you have access to both data sets as well as secure Stacks in the cloud. Using B23’s EasyIngest capability, data scientists are only a few mouse clicks and several minutes away from analyzing data in a securely provisioned Stack. Recently, I had the opportunity to work on a project that highlights the capabilities of B23 Data Platform — geospatial analysis using Apache Spark. This included using a large dataset containing network-based location data in geographic coordinates. Specifically, this dataset contained over a billion rows of latitude and longitude metrics with timestamps over a multi-year period. The challenge was to figure out how many of these “locations” map to certain points of interest (“POI”) each day using this initial, raw, dataset. I was able to complete this geo-spatial enrichment task in the following 5 steps: Acquire POI data Determine how to join the data sets Transform the data Geo-fence the POI data Run spatial join Acquire POI Data My first step was to pull a second dataset containing geospatial data for a particular POI. We used POI data that contains the locations for 1000+ restaurant chains in North America. In one case, I downloaded the data for the restaurant Chipotle, which has 2,076 locations (as of 6/13/16).   Determine How to Join Datasets Now that I had acquired my datasets, I needed a plan to join them. The first dataset contains about 6.5 million records per day. After doing some quick math, I realized that joining each of these...

B23 COVER STORY – ANALYTICS MAGAZINE: Predictive Analytics in the Cloud: It’s all about the Data

During the 1992 Presidential election, the Clinton team coined the phrase “the economy, stupid,” as an easy way to remember one of the most important platforms of the campaign. For the cloud – and especially predictive analytics in the cloud – it’s not the economy, but the data, that makes all the difference. These days, as the cloud is making storage of enterprise data easier and more affordable for companies of any size, every business is now a data business, whether they know it or not. And that will be truer still as the Internet of Things begins to collect and contribute data to enterprise systems from nearly every household and business device or appliance. You have to assume that the volume of enterprise data will increase (possibly exponentially) every year. Most organizations already are overwhelmed with data and can’t process it fast enough. Enter the cloud. There’s a natural synergy between the cloud and analytics. The cloud allows you to scale out horizontally easily and quickly, which in turn enables you to look across silos of data to identify developing trends. Most companies that are struggling with a move to the cloud are concerned in particular with how to migrate data to this new computing environment – and that’s where they’re going wrong. New technology makes it much more practical to scratch-build their data repositories in the cloud rather than migrate data to the cloud. After that, complex data analysis can be underway in minutes rather than months (if at all). Let’s look back at the cloud, and ways to make the best use of the technology when putting predictive analytics to work on an enterprise scale. “Power Company” of the Internet Age In the past, on-premise data collection and management was limited because IT resources were finite and expensive. That has changed with the cloud. Think of it as analogous to the power company at the turn of the 20thcentury. In the late 1880s when electricity was just coming on the scene in industry, every business built its own generating capacity at great expense to the...