Introducing B23 Dataflow Monitoring

Monitoring data flows is a key differentiator for organizations that operate production Artificial Intelligence (“AI”) and Machine Learning (“ML”) workflows in order to derive accurate, reliable, and responsive outcomes. B23 is a pioneer in the area of AIOps and monitoring production dataflows at scale. We will highlight 3 important points in this blog for you:   Background of Dataflow Monitoring Why Dataflow Monitoring is so important for AIOps Detail into B23 Dataflow Monitoring for AIOPs.   Background of Dataflow Monitoring   On behalf of our customers, B23 manages and operates production, enterprise-scale data lakes as centralized stores for diverse types of data. The enterprise data lake serves diverse business purposes, often supporting workloads for AI/ML and business intelligence (“BI”).   Data lakes are not static — they are often connected by a complex series of rivers and streams, some flowing into the lake from external sources and some leaving the lake to downstream consumers. We call these rivers and streams “dataflows.” There are also dataflows within the data lake, where datasets are transformed or fused to suit customer needs, and then written back to the same data lake environment. These internal flows include extract-transform-load (“ETL”) pipelines and AI/ML training or inference pipelines.   A common data lake contains data that can have very different characteristics:   Structured vs unstructured Different serialization formats (comma-separated values, parquet, JSON) Different naming conventions Different delivery cadences (continuous, burst-y, periodic, ad hoc)   As data flows in and out of the data lake, it often crosses inter-organization and intra-organizational boundaries. A variety of legacy monitoring and alerting tools exists for infrastructure, networking, and applications. These tools monitor traditional “operations” data such as system availability, processing latency, network throughput, CPU utilization, disk utilization, etc. The commercial cloud providers offer a...

B23 Announces B23 Data Platform integration with the Google Kubernetes Engine (“GKE”)

Today we are excited to announce our B23 Data Platform integration with the Google Kubernetes Engine (“GKE”).   Kubernetes is an exciting technology that helps customers orchestrate complex containerized workloads using templatized configurations.  In many ways, Kubernetes is an extension of the same automation and orchestration concepts we started developing with cloud-based virtual machines five years ago when we introduced the B23 Data Platform.   That’s why it made perfect sense to enhance our existing data platform offerings with multiple cloud-vendor Kubernetes services which will extend our data engineering and applied machine learning workloads to even more environments.   B23 has been “productionizing” the difficult and non-differentiated data engineering activities for Fortune 20 companies, financial services, large cybersecurity, leading telecommunications providers, and many other firms for several years.   This integration will make it easy to service more customers who prefer Google Cloud Platform (“GCP”) as they ask B23 to build, manage, and operate complex data pipelines.   This video shows a brief overview of how we have simplified the process to extend customer machine learning workloads onto GKE.   https://www.youtube.com/watch?v=cwiW0JEe8Lc&t   B23 provides managed data engineering and applied machine learning services for its customers so they can focus on the extracting the business value of data – and not focus on commoditized engineering. Building and operating durable data analysis infrastructure, and running algorithms at scale, on a 24/7 basis, are challenges that most modern organizations are facing today. By partnering with B23, our customers’ business analysts, data scientists, and machine learning engineers are free to focus on their core-competency, performing data analysis that will be most impactful to their business.   The B23 Data Platform supports a variety of data-processing and analysis-centric software.  We support both open source software, as well...

Four Reasons Why Data Engineering is a Zero-Sum Game for Most Organizations

September 17th, 2018   Data engineering is hard and getting it exactly 100% right results in a single outcome – machine learning engineers and quants can now do their job effectively.  The analysis of data and the subsequent execution of those insights is the competitive differentiator and core competency of business – its heart and soul.  Data engineering is the commoditized heavy-lifting every organization needs to perform to get the analysis correct.  This is why we see data engineering as a zero-sum game.  Getting data engineering right means organizations are just breaking even – it simply allows other employees to do their job properly.  Getting it wrong means everything and everyone else dependent on data engineering cannot operate effectively. Outsourcing the commoditized heavy-lift data engineering is the least risky and most cost-efficient path to achieve the economic and market-leading competitive advantages organizations need to compete.     Prioritize Algorithm Development Over Data Engineering Modern organizations should prioritize and invest in the algorithm development, quantitative research, and machine learning aspects of data science.  These activities can make or break firms who use data for a competitive advantage.  Applying machine learning in a meaningful way using data formatted specifically for those algorithms is not a trivial task.  To be successful, organizations should recognize the undifferentiated and differentiated activities associated with extracting insight from data and decouple the activities required to get data into a specific format (or schema) to support those algorithms from the development and tuning of those algorithms.   Race Car Drivers and Data Mechanics An interesting social phenomenon we’ve observed over the past several years is that we have yet to meet a data engineer that wasn’t secretly plotting a career change to become a machine learning engineering and/or quant and with a more data science-centric job title to-boot.  If machine learning engineers and quants...

Announcing Jupyter Notebook on the B23 Data Platform

March 7th, 2018   B23 is happy to announce we’ve added Jupyter Notebook as the latest stack in our platform. Jupyter has quickly become a favorite with Data Scientists because of its notebook format and support of many programming languages like R, Scala and Python. The B23 Data Platform gives Data Scientists access to their preferred data processing tools in a secure, automated environment targeted specifically to their business needs. According to a recent Harvard Business Review survey, 80% of organizations believe the inability for teams to work together on common data slows the organization’s ability to quickly reach business objectives. B23 Data Platform can help these organizations boost their data science team’s productivity with notebook collaboration and sharing tools like Jupyter. Thanks to easier data access and computing power paired with rich web user interfaces, open source capabilities and scalable data cloud-processing solutions, Jupyter Notebook adds another favored power tool to B23 Data Platform. With just a few button clicks, the Jupyter stack launches.     Open the Jupyter Notebook URL and you are ready start coding!     The B23 Data Platform is an open, secure, and fast marketplace for big data, data science, artificial intelligence and machine learning tools. In minutes, Data Scientists can securely analyze their data in the cloud — with the freedom to use familiar tools like Apache Spark, Apache Zeppelin, R Studio, H2O and Jupyter Notebook. Discover a better way to analyze your data with B23.      ...

Spark Geospatial Analysis with B23 Data Platform

August 30th, 2016 As a member of the B23 Data Platform development and data science team, we’ve been excited to continue to launch new innovative and secure features that allow Data Scientists to process data more effectively and quickly than previously possible. We launched B23 Data Platform in early 2016 as a data science orchestration platform. B23 Data Platform is both a marketplace for data as well as marketplace of big data Stacks that can be provisioned in minutes. Most automated provisioning tools are just a “blank canvas,” but with B23 Data Platform you have access to both data sets as well as secure Stacks in the cloud. Using B23’s EasyIngest capability, data scientists are only a few mouse clicks and several minutes away from analyzing data in a securely provisioned Stack. Recently, I had the opportunity to work on a project that highlights the capabilities of B23 Data Platform — geospatial analysis using Apache Spark. This included using a large dataset containing network-based location data in geographic coordinates. Specifically, this dataset contained over a billion rows of latitude and longitude metrics with timestamps over a multi-year period. The challenge was to figure out how many of these “locations” map to certain points of interest (“POI”) each day using this initial, raw, dataset. I was able to complete this geo-spatial enrichment task in the following 5 steps: Acquire POI data Determine how to join the data sets Transform the data Geo-fence the POI data Run spatial join Acquire POI Data My first step was to pull a second dataset containing geospatial data for a particular POI. We used POI data that contains the locations for 1000+ restaurant chains in North America. In one case, I downloaded the data for the restaurant Chipotle, which has 2,076 locations (as of 6/13/16).   Determine How to Join Datasets Now that I had acquired my datasets, I needed a plan to join them. The first dataset contains about 6.5 million records per day. After doing some quick math, I realized that joining each of these...