Introducing B23 Dataflow Monitoring
Monitoring data flows is a key differentiator for organizations that operate production Artificial Intelligence (“AI”) and Machine Learning (“ML”) workflows in order to derive accurate, reliable, and responsive outcomes. B23 is a pioneer in the area of AIOps and monitoring production dataflows at scale. We will highlight 3 important points in this blog for you:
Background of Dataflow Monitoring
Why Dataflow Monitoring is so important for AIOps
Detail into B23 Dataflow Monitoring for AIOPs.
Background of Dataflow Monitoring
On behalf of our customers, B23 manages and operates production, enterprise-scale data lakes as centralized stores for diverse types of data. The enterprise data lake serves diverse business purposes, often supporting workloads for AI/ML and business intelligence (“BI”).
Data lakes are not static — they are often connected by a complex series of rivers and streams, some flowing into the lake from external sources and some leaving the lake to downstream consumers. We call these rivers and streams “dataflows.” There are also dataflows within the data lake, where datasets are transformed or fused to suit customer needs, and then written back to the same data lake environment. These internal flows include extract-transform-load (“ETL”) pipelines and AI/ML training or inference pipelines.
A common data lake contains data that can have very different characteristics:
Structured vs unstructured
Different serialization formats (comma-separated values, parquet, JSON)
Different naming conventions
Different delivery cadences (continuous, burst-y, periodic, ad hoc)
As data flows in and out of the data lake, it often crosses inter-organization and intra-organizational boundaries.
A variety of legacy monitoring and alerting tools exists for infrastructure, networking, and applications. These tools monitor traditional “operations” data such as system availability, processing latency, network throughput, CPU utilization, disk utilization, etc. The commercial cloud providers offer a wealth of operational metrics for all of their infrastructure, platform, and application services. It is critical to monitor these existing metrics. However, the data itself (rather than the systems manipulating the data) has become an important enterprise asset. As systems and people become more reliant on data, particularly those applications utilizing AI/ML algorithms, it is very important that data pipelines be robust, transparent, auditable, governed, and observable.
B23 thinks differently about how to monitor and manage dataflows. This blog will introduce our philosophy and tooling for monitoring data flows.
Why is Dataflow Monitoring Important?
At a very basic level, well-operated data lakes should have existing service level agreements (“SLA’s”) in place for data input sources and for downstream consumers. The goal of these SLA’s is to provide AI/ML centric consumers of these data flows a predictable set of operating conditions that they should expect with regards to the delivery of their data. SLA’s are almost always a type of contractual obligation, and monitoring is critical for data engineers and data infrastructure teams to recognize potential disruptions such that they can maintain their SLAs.
Aside from the contractual obligations, production data lakes should have some form of anomaly detection when mission-critical systems consume from the data lake. Artificial and real anomalies may occur frequently in a data pipeline. A real data anomaly may occur with a surge or gap in data caused by real-world events, such as an extreme weather event causing a surge in data. Artificial data anomalies may occur when an upstream or dependent system suffers an outage. In either case, it is critical that downstream data consumers, systems, and algorithms maintain an awareness when volume or composition of data changes suddenly. This data may be unfit for use in AI/ML model training and/or model inference.
In addition to discrete anomalies, sometimes a characteristic of a dataset may change slowly over time, resulting in “data drift.” For instance, this may occur when a mobile app starts gaining or losing traffic with a new demographic, such as an increase in new active users from a new country due to a promotion. This may result in “volume drift” where the total amount of data slowly grows or shrinks. This may also result in “content drift” where the attributes or characteristics about that data change. The data used to train an AI/ML model may no longer be valid given these new characteristics. Additionally, a newly deployed AI/ML model may have introduced a new bias that is only apparent once deployed to production.
The graphic below shows different types of behavior that may occur in a given dataflow. Some of these may represent valid or real behavior and some may be anomalies requiring attention.
Figure 1: Different Dataflow Behavior
Finally, different AL/ML model versions, derived by either different training data or different hyperparameters, can lead to different inference behavior once deployed. The difference between model performance and outcomes in these scenarios may be subtle. Monitoring is very important for AI/ML lifecycle management such that data scientists have a transparent view of how data attributes may have changed after a new model is deployed.
B23 Dataflow Manager Overview
At B23, we provide our internal staff and our customers with a common view on rate of flow and the nature of the flow for a given dataset. The flow of data has important attributes that are not readily observed by common operational monitoring solutions. For example, we can perform macro-level dataflow monitoring where we expect the total number of files or bytes to be delivered within a predictable range, without interruption, every day with a specific file or directory naming convention. We may also expect a percentage of a field to be non-null or fall within some expected statistical distribution. We define and observe these metrics for all data flows, and we are able to make assertions that pass or fail based on expected behavior. When assertions fail, we find out about it right away. Think unit tests, but on the data itself.
Over many years of building and operating production data infrastructure, we have built these features into a cohesive solution: B23 Dataflow Manager.
B23 Dataflow Manager helps data engineers, data scientists, application developers, and infrastructure engineers monitor and understand all of their data pipelines in a single place. Through simple and customizable YAML configurations, B23 Dataflow Manager users can define data flows, set up configurable assertions for expected flow behavior, and configure alert behavior for when an assertion fails.
Assertions include upper and lower boundaries that are expected for a given metric. For example, we might “assert” that an S3 bucket should have a new daily “partition” written using a common directory naming convention (e.g., “data/raw/day=20200105/”) that contains more than 50 files and less than 60 files. Over time and with enough history, B23 Dataflow Manager can “learn” the behavior for a given flow and recommend or automatically define assertion boundaries. Some flows may be very consistent whereas others may have large variability in the number of files or bytes. Some flows are expected to grow slowly, and B23 Dataflow Manager can learn those expected growth patterns.
When an assertion fails, B23 Dataflow Manager will send an alert. The type of alert is configurable, and it could be an email, a message on a queue, or a simple log message. Users can even implement and configure custom alert endpoints so that they can be plugged into other commercial or proprietary issue management systems.
Organizations can have very different data infrastructure, and B23 Dataflow Manager is highly extensible to meet different types of flow technologies and patterns. There are several common data persistence stores, and each has different characteristics for how data might flow from one source to another. Often an organization will utilize multiple or all of these data stores, such as:
Object Storage (AWS S3, Google Cloud Storage, Azure Storage)
Streaming Event Log (Apache Kafka)
Scalable Data Warehouse (AWS Redshift, Google BigQuery, Snowflake)
Relational Database (MySQL, Postgres)
For all of these data persistence technologies, we can consistently monitor the characteristics of the flow itself (events, objects, bytes, rows) and the content of objects and events, looking at attributes such as schema changes, null-value counts, filtering percentages, etc.
Deployed as a stand-alone application, B23 Dataflow Manager runs in any cloud or on-premise environment. Because B23 Dataflow Manager runs as a stand-alone application, it operates “out-of-band,” meaning the technology or platform flowing the data is not also performing the monitoring.
The need to apply operational discipline and rigor to dataflows has never been so important to succeed in AL/ML, and B23 is leading the way with its new approach. B23 Data Manager provides an all-in-one and inclusive AIOps solution for AI/ML workflows allowing your business to make the hype of AI/ML a reality. If you would like to learn more or see a demo please contact us at firstname.lastname@example.org.