Experimenting with Chromebook Data Science in the Cloud

September 26th, 2017   Last spring I gave a talk at New York R Conference and EARL SF titled “The Missing Manual for Running R on Amazon Cloud”. It was meant to be targeted at small (or large) enterprise users looking to build out or outfit a data science team capable of doing effective data science in the cloud with all the data ingest, security and usability concerns and implications that come with navigating that space. In recent months I’ve been surprised but overjoyed to see cloud data science start to become championed by citizen data nerds and #rstats community folks in academia. Jeff Leek has been reporting on his second attempt at a Chromebook data science experiment. Brian Caffo has several videos on his YouTube channel about “going cloudy”, including a review of Jeff Leek’s blog post from above. In an experiment of unknown and unplanned duration, I’ve been leaving my work laptop at work on Friday night, then resisting the urge to go get it on Saturday morning. If I need or want to do anything on the computer, be it R or otherwise, I have to figure out how to do it on my Acer Chromebook 11.       The major things, like working with RStudio server on AWS, aren’t all that different from how I operate every day at work. I do find that I’m more likely to “cheat” and use a local-cloud hybrid approach to data management when I’m using my work machine, and I like that the Chromebook forces me to honestly evaluate the usability of the cloud data science system we’ve designed. It’s the little things that have me feeling constrained on the Chromebook. Taking screenshots, managing them all, editing diagrams and trying to create slide deck presentations is all a bit of a drag. So far I’ve felt more effective switching to my phone when I need to do that sort of thing. Making an Acer Chromebook 11 feel satisfying to operate is probably an entirely lost cause, but there is something really fun about having all the power of the cloud at your fingertips on one of the cheapest little laptops money can buy.    ...

Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

April 25th, 2017   The Kinesis Client Library provides convenient abstractions for interacting with Amazon Kinesis. Consumer checkpoints are automatically tracked in DynamoDB (Kinesis checkpointing) and it’s easy to spawn workers to consume data from each shard (Kinesis term for a partition) in parallel. For those unfamiliar with checkpointing in streaming applications, it is the process of tracking which messages have been successfully read from the stream. Spark Streaming implements a receiver using the Kinesis Client Library to read messages from Kinesis. Spark also provides a utility called checkpointing (Spark checkpointing; not to be confused with Kinesis checkpointing in DynamoDB) which helps make applications fault-tolerant. Using Spark checkpointing in combination with Kinesis checkpointing provides at-least-once semantics. When we tried to implement the recommended solution using Spark checkpointing, it was very difficult to develop any code without breaking our checkpoints. When Spark saves checkpoints, it serializes the classes which define the transformations and then uses that to restart a stopped stream. If you then change the structure of one of the transformation classes, checkpoints become invalid and cannot be used for recovery. (There are ways to make code changes without breaking your application’s checkpoints, however in my opinion they add unnecessary complexity and risk to the development process as cited in this example). This challenge, in combination with a sub-optimal at-least-once guarantee, led us to abandon Spark checkpointing to pursue a simpler, albeit somewhat hacky, alternative. Every message sent to Kinesis is given a partition key. The partition key determines the shard to which the message is assigned. Once a message is assigned to a shard, it is given a sequence number. Sequence numbers within a shard are unique and increase over time. (If the producer is leveraging message aggregation, it is possible for multiple consumed messages to have the same sequence number) When starting up a Spark...

Introducing B23r — An R package for Importing and Exporting Entire R Environments backed by Amazon’s Simple Storage Service (S3)

January 18th, 2017 The B23 Data Platform is free to use when launching R services on Amazon Web Services (“AWS”) R is an analytics technology that has been slow to adapt to its use in the Cloud. Other analytical solutions such as Spark and Hadoop have evolved over the past several years to include specific capabilities that facilitate the use of those platforms in the Cloud. B23 has been enabling analytical and distributed processing applications in the Cloud for many years, and we are happy to announce a new, efficient, and effective capability to persist R environments in the Amazon Cloud to help organizations keep cloud computing costs low while also increasing the ability for R users to collaborate easier. Adapting R for the Ephemeral Nature of Cloud Computing As data gets bigger and computation more complex, freeing R from the constraints and limitations of running it locally on a laptop is often a critical concern. The Cloud is the optimal computing platform for R for a variety of reasons. Our B23 Data Platform was designed to run R optimally in the Amazon Cloud. The high-level benefits of running R in the Cloud include optimal processing data stored locally in the Cloud, the ability to use varied computational resources that best suite analysis requirements (as opposed to a laptop with fixed computational resources), the ability to terminate compute and storage resources (and therefore stop incurring costs) when they are no longer needed, and enabling a security framework to securely ingest data directly into an R environment. Unfortunately, once started working in the Cloud, you’ll face challenges to working with R that aren’t well defined. In four (4) easy steps below we describe a solution for how to leverage the B23r package to help bridge these gaps. Challenge One of the biggest and most basic issues we’ve heard from users is concern about saving and restoring your work when taking advantage of the ephemeral nature of the Cloud. In 2016, Apache Zeppelin released a capability that allowed data science notebooks to persist...

UNDER THE RADAR: B23

Bisnow has started a series profiling interesting, yet under-the-radar tech companies in the DC region. A little-known company called B23, which was started by two of the original employees at Amazon Web Services, kicks off our series. Send other suggestions for this column to Bisnow’s tech editor, Tania Anderson…. https://www.bisnow.com/washington-dc/news/tech/under-the-radar-b23-54644