Not All Kubernetes Services Are Equal. We Should Know.

Kubernetes promises the long sought-after capability to fully abstract underlying public cloud, private cloud, and edge infrastructure from the perspective of software applications that perform specific functions, or workloads. For B23, the value of Kubernetes means that all of the innovative and ground-breaking data engineering and applied machine learning workloads that we have developed and operated based on years of experience can be seamlessly deployed in almost any environment that runs Kubernetes. B23 supports and operates a variety of Kubernetes solutions including “pure” Kubernetes that we deploy to any arbitrary set of supported server hosts. We also support public cloud managed Kubernetes services from Google, Amazon, Microsoft, and DigitalOcean. We support integration to a previously running Kubernetes system to address private cloud Kubernetes solutions. Most recently, we support Rancher’s K3S for edge computing solutions (more on that exciting news in a later blog). We’ve done Kubernetes the “hard way” from scratch, and we’ve done it the “easy way” using cloud managed Kubernetes, or at least we thought managed Kubernetes would be easy. In some cases, the “easy way” was just not so easy. That’s why the “conceptual value” of Kubernetes varies from the “actual value” of Kubernetes. It depends heavily on your cloud service provider. Here are some of the high-level differences we have found in our pursuit to achieve our ultimate goal of infrastructure agnostic workloads using Kubernetes. They fall into the following categories: Default security features and versions vary by Kubernetes service provider Non-existing or limited built-in support for Kubernetes auto-scaling capabilities across service providers Some service providers require proprietary or provider-specific functionality leading to vendor lock-in The workflow and lifecycle management of Kubernetes and hosted workloads vary in capability and complexity The SDK ecosystem for programmatically operating managed Kubernetes solutions vary greatly in maturity The...

Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

April 25th, 2017   The Kinesis Client Library provides convenient abstractions for interacting with Amazon Kinesis. Consumer checkpoints are automatically tracked in DynamoDB (Kinesis checkpointing) and it’s easy to spawn workers to consume data from each shard (Kinesis term for a partition) in parallel. For those unfamiliar with checkpointing in streaming applications, it is the process of tracking which messages have been successfully read from the stream. Spark Streaming implements a receiver using the Kinesis Client Library to read messages from Kinesis. Spark also provides a utility called checkpointing (Spark checkpointing; not to be confused with Kinesis checkpointing in DynamoDB) which helps make applications fault-tolerant. Using Spark checkpointing in combination with Kinesis checkpointing provides at-least-once semantics. When we tried to implement the recommended solution using Spark checkpointing, it was very difficult to develop any code without breaking our checkpoints. When Spark saves checkpoints, it serializes the classes which define the transformations and then uses that to restart a stopped stream. If you then change the structure of one of the transformation classes, checkpoints become invalid and cannot be used for recovery. (There are ways to make code changes without breaking your application’s checkpoints, however in my opinion they add unnecessary complexity and risk to the development process as cited in this example). This challenge, in combination with a sub-optimal at-least-once guarantee, led us to abandon Spark checkpointing to pursue a simpler, albeit somewhat hacky, alternative. Every message sent to Kinesis is given a partition key. The partition key determines the shard to which the message is assigned. Once a message is assigned to a shard, it is given a sequence number. Sequence numbers within a shard are unique and increase over time. (If the producer is leveraging message aggregation, it is possible for multiple consumed messages to have the same sequence number) When starting up a Spark...

Introducing B23r — An R package for Importing and Exporting Entire R Environments backed by Amazon’s Simple Storage Service (S3)

January 18th, 2017 The B23 Data Platform is free to use when launching R services on Amazon Web Services (“AWS”) R is an analytics technology that has been slow to adapt to its use in the Cloud. Other analytical solutions such as Spark and Hadoop have evolved over the past several years to include specific capabilities that facilitate the use of those platforms in the Cloud. B23 has been enabling analytical and distributed processing applications in the Cloud for many years, and we are happy to announce a new, efficient, and effective capability to persist R environments in the Amazon Cloud to help organizations keep cloud computing costs low while also increasing the ability for R users to collaborate easier. Adapting R for the Ephemeral Nature of Cloud Computing As data gets bigger and computation more complex, freeing R from the constraints and limitations of running it locally on a laptop is often a critical concern. The Cloud is the optimal computing platform for R for a variety of reasons. Our B23 Data Platform was designed to run R optimally in the Amazon Cloud. The high-level benefits of running R in the Cloud include optimal processing data stored locally in the Cloud, the ability to use varied computational resources that best suite analysis requirements (as opposed to a laptop with fixed computational resources), the ability to terminate compute and storage resources (and therefore stop incurring costs) when they are no longer needed, and enabling a security framework to securely ingest data directly into an R environment. Unfortunately, once started working in the Cloud, you’ll face challenges to working with R that aren’t well defined. In four (4) easy steps below we describe a solution for how to leverage the B23r package to help bridge these gaps. Challenge One of the biggest and most basic issues we’ve heard from users is concern about saving and restoring your work when taking advantage of the ephemeral nature of the Cloud. In 2016, Apache Zeppelin released a capability that allowed data science notebooks to persist...