Announcing Jupyter Notebook on the B23 Data Platform

March 7th, 2018   B23 is happy to announce we’ve added Jupyter Notebook as the latest stack in our platform. Jupyter has quickly become a favorite with Data Scientists because of its notebook format and support of many programming languages like R, Scala and Python. The B23 Data Platform gives Data Scientists access to their preferred data processing tools in a secure, automated environment targeted specifically to their business needs. According to a recent Harvard Business Review survey, 80% of organizations believe the inability for teams to work together on common data slows the organization’s ability to quickly reach business objectives. B23 Data Platform can help these organizations boost their data science team’s productivity with notebook collaboration and sharing tools like Jupyter. Thanks to easier data access and computing power paired with rich web user interfaces, open source capabilities and scalable data cloud-processing solutions, Jupyter Notebook adds another favored power tool to B23 Data Platform. With just a few button clicks, the Jupyter stack launches.     Open the Jupyter Notebook URL and you are ready start coding!     The B23 Data Platform is an open, secure, and fast marketplace for big data, data science, artificial intelligence and machine learning tools. In minutes, Data Scientists can securely analyze their data in the cloud — with the freedom to use familiar tools like Apache Spark, Apache Zeppelin, R Studio, H2O and Jupyter Notebook. Discover a better way to analyze your data with B23.      ...

Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

April 25th, 2017   The Kinesis Client Library provides convenient abstractions for interacting with Amazon Kinesis. Consumer checkpoints are automatically tracked in DynamoDB (Kinesis checkpointing) and it’s easy to spawn workers to consume data from each shard (Kinesis term for a partition) in parallel. For those unfamiliar with checkpointing in streaming applications, it is the process of tracking which messages have been successfully read from the stream. Spark Streaming implements a receiver using the Kinesis Client Library to read messages from Kinesis. Spark also provides a utility called checkpointing (Spark checkpointing; not to be confused with Kinesis checkpointing in DynamoDB) which helps make applications fault-tolerant. Using Spark checkpointing in combination with Kinesis checkpointing provides at-least-once semantics. When we tried to implement the recommended solution using Spark checkpointing, it was very difficult to develop any code without breaking our checkpoints. When Spark saves checkpoints, it serializes the classes which define the transformations and then uses that to restart a stopped stream. If you then change the structure of one of the transformation classes, checkpoints become invalid and cannot be used for recovery. (There are ways to make code changes without breaking your application’s checkpoints, however in my opinion they add unnecessary complexity and risk to the development process as cited in this example). This challenge, in combination with a sub-optimal at-least-once guarantee, led us to abandon Spark checkpointing to pursue a simpler, albeit somewhat hacky, alternative. Every message sent to Kinesis is given a partition key. The partition key determines the shard to which the message is assigned. Once a message is assigned to a shard, it is given a sequence number. Sequence numbers within a shard are unique and increase over time. (If the producer is leveraging message aggregation, it is possible for multiple consumed messages to have the same sequence number) When starting up a Spark...

Announcing File-Level Integration with GitHub

February 23rd, 2017   Packing up and moving your big data from on-premises can be daunting from workload capacity and timing pressures to worry over protecting your proprietary projects. B23 Data Platform has already automated the process of launching software inside virtual private clouds. Now B23’s latest feature offers you the convenience and security of seamless GitHub integration. Want to avoid using a config file or command line? Not excited about loading and installing tons of software to sync your project files? B23 Data Platform now automatically clones your Git repository upon your stack launch. B23 has brought GitHub integration to you and the other 70,000+ organizations that trust the GitHub software building community.     Got a few minutes? B23 lets you select individual files or entire stacks of data to move. Just sit back and relax while you watch your projects securely populate the R environment where your data will reside on one or more nodes without ever touching the Internet, unless you so desire. With OAuth, there’s also no need to provide any sensitive credentials to B23 for authorization. Walking through our GitHub integration, you’ll be presented the option to link your GitHub account with B23 Data Platform.     Then you select one of the GitHub repositories under your account or enter a customized repository name.     Finally, you can select individual files to move or copy the entire repository to your chosen destination.     Now we’ve removed another step between you and your data! GitHub integration is currently available for the R and SparklyR stacks. Be sure to check out this feature as we continue to provision more opportunities for automated deployment of development environments. Please reach out to us at info@b23.io for more ways we can help drive your data solutions....

Introducing Integration with Enterprise Authentication for B23 Data Platform

February 22nd, 2017   Imagine how much extra productivity you could unleash if your time wasn’t burdened trying to manage a myriad of username and password combinations. With ever-growing demands for tighter security, increasingly complex passwords are a mental drain. On a personal level it’s frustrating, but on an enterprise level it can be paralyzing.     Thankfully, B23 Data Platform now offers seamless Identity Provider Initiated (IDP-initiated) authentication integration with Okta and other identity providers through Security Assertion Markup Language 2.0 (SAML 2.0). Taking away such burdens as managing unique user accounts for yet another cloud application, all users in your enterprise organization can simply authenticate through your preexisting Okta dashboard or other SAML identity provider with minimal setup required. B23 Data Platform can also be pulled in as another chicket and launched from users’ familiar application dashboard.         What if I’m using Lightweight Directory Access Protocol (LDAP)? If you’ve tied it in to your SAML provider, you’re good to go! B23 appreciates the time and energy your company has already spent integrating Single Sign-on solutions and will continue to incorporate other third-party SAML 2.0 identity provider solutions like Ping, AuthO and Centrify. If you’re interested in setting up B23 Data Platform for your organization, please reach out to info@b23.io or visit www.b23.io....

The Modern Data Economy — Part II of III: Tools of the Data Economy

May 11th, 2016 January 2013 Introduction to Spark and Shark At the heart of our data transformation and productization approach are a small number of important tools including Apache Spark and the nascent Apache Airflow (incubating). I first heard of Apache Spark in January 2013 while working at Amazon Web Services (“AWS”). I was part of the team that launched Apache Accumulo on Elastic MapReduce (“EMR”). During this time I was able to collaborate with a team member who was implementing a similar capability with Spark and Shark. At that time, Spark and Shark were both UC Berkeley AMPLab projects, and their similar sounding names confused more than a handful of people. In the years since, and continuing into the present, there have continued to be many misconceptions about Spark. Some of these misconceptions include Spark only working with in-memory data sets, that Spark was only intended for machine learning, that Spark could not scale like MapReduce, etc. Spark has grown so fast and quickly, that we believe some of its best capabilities are not well understood. Spark is not perfect, but in our real-world implementation experience we have found Spark to be one of the most ubiquitous, effective solutions for almost every conceivable data requirement. It has become the Swiss Army Knife for the Data Economy. Spark and Its Subsystems Early on, we made a deliberate decision early to use Scala for all of our Spark development work. We had years of experience tuning Java Virtual Machines (“JVM’s”), and most of us had functional programming backgrounds due to a recent immersion in the Clojure programming language. Unrelated to Spark, we had recently run into Python’s global interpreter lock (“GIL”) when operating on a petabyte-scale ingest project, and we definitely wanted to make sure we did not find ourselves in that boat again anytime soon. We still use Python quite a bit, but Scala was an easy choice overall with several positive outcomes detailed below. Spark’s MLlib Machine Learning capabilities get the lion’s share of attention these...