January 18th, 2017
The B23 Data Platform is free to use when launching R services on Amazon Web Services (“AWS”)
R is an analytics technology that has been slow to adapt to its use in the Cloud. Other analytical solutions such as Spark and Hadoop have evolved over the past several years to include specific capabilities that facilitate the use of those platforms in the Cloud. B23 has been enabling analytical and distributed processing applications in the Cloud for many years, and we are happy to announce a new, efficient, and effective capability to persist R environments in the Amazon Cloud to help organizations keep cloud computing costs low while also increasing the ability for R users to collaborate easier.
Adapting R for the Ephemeral Nature of Cloud Computing
As data gets bigger and computation more complex, freeing R from the constraints and limitations of running it locally on a laptop is often a critical concern. The Cloud is the optimal computing platform for R for a variety of reasons. Our B23 Data Platform was designed to run R optimally in the Amazon Cloud. The high-level benefits of running R in the Cloud include optimal processing data stored locally in the Cloud, the ability to use varied computational resources that best suite analysis requirements (as opposed to a laptop with fixed computational resources), the ability to terminate compute and storage resources (and therefore stop incurring costs) when they are no longer needed, and enabling a security framework to securely ingest data directly into an R environment. Unfortunately, once started working in the Cloud, you’ll face challenges to working with R that aren’t well defined. In four (4) easy steps below we describe a solution for how to leverage the B23r package to help bridge these gaps.
One of the biggest and most basic issues we’ve heard from users is concern about saving and restoring your work when taking advantage of the ephemeral nature of the Cloud. In 2016, Apache Zeppelin released a capability that allowed data science notebooks to persist in Amazon S3 buckets. B23 Data Platform has enhanced this feature in our Zeppelin stack to make it easier to integrate to S3 as well. No such feature exists in natively in R, which can quickly lead to serious financial or productivity blockers for you or your team. Consider the following typical workflow for provisioning and utilizing cloud resources to run R or R Studio on a daily basis:
Based on your experience and company, it can take minutes to weeks to launch an R instance on AWS. For each new R instance you create, it will take a certain amount of time and active engagement to install packages and set up the R or R Studio environment to meet the needs of your current work. Running R in a secure Amazon virtual private cloud (“VPC”) costs money. Leaving your R instance running indefinitely may not be an option for you financially. For most organizations, securing an Amazon environment is more of an art than a science. Securing ephemeral environments consistently is a significant challenge. Launching and terminating R instances on a daily basis can become burdensome. As the complexity of your working environment increases, you may start to calculate tradeoffs between the cost of leaving your R instance running and starting over from scratch with each new instance.
Using B23 Data Platform, leverage B23r on AWS to enhance cloud-based R behind the scenes
Step 1 — Launch a new R instance through B23 Data Platform with the basic configurations of your choice.
Step 2 — Select a bucket from your AWS S3 account from the drop down in the B23 Data Platform user interface, in which you would like to start storing R project files. B23 will automatically create a B23/r-projects folder in the bucket you select if one does not yet exist. You must have at least one bucket available in S3 to get started.
Step 3 — Launch the new R stack within B23 Data Platform. This will take about 10 minutes.
Step 4 — Access the R Studio UI via the Stack URL provided and sign in with username/password
admin/admin. The B23r package library will be installed and loaded as part of the stack launch process. Instructions for all the save/restore functions with B23r will be in an Rmarkdown document in the
/home/admin/ working directory. Any project files previously saved with B23r to the S3 bucket you selected will also be in the working directory.
How to use B23r in R Studio
The following four parts will demonstrate Initialize/Save, Restore, Save As, and Refresh Files from the pre-defined S3 bucket connection.
Part 1 How to Save Using B23r — The R Studio project-based approach
Start a new project through R Studio (
File > New Project > New Directory)
When you are ready to save, run the B23r functions:
This action will bundle your project directory as well as maintain a record of the packages (including version data) you’ve installed. If you want to retain your workspace as part of the project, make sure to save the environment as an .Rdata file to the project directory. B23 will create a folder tree /B23/r-projects/ in your S3 bucket where all bundled projects will then be stored.
Part 2 How to Restore Using B23r
Open a previously saved project file in two steps:
- Unbundle the
tar.gzproject file with the B23r function
restoreStack(). For example, if the project file is called
test.tar.gz, then the command
restoreStack('test.tar.gz')will restore that archived environment. This function takes string input of the full project file name and creates a sub-folder in the working directory for the
.Rprojfile and any other associated data files. The unbundling process can take several minutes as exact versions of the packages used are installed with the project.
- Open the new folder and click on the
.Rprojfile to launch the project in R Studio. You may still navigate back to the
/home/admindirectory at any time through the files navigation console.
Part 3 How to Save As Using B23r — Versioning projects
Running the B23r
saveStack() function without any input parameters will perform the bundle/save using a default naming convention ProjectName-SystemDate. Saving projects multiple times on the same day using the default naming convention will result in over-writing the old project file. You may choose to version your project by providing the save function a custom name, e.g.
saveStack() with a custom name will only work after the initial initStack() and saveStack() functions have been run.
Part 4 How to Refresh — Pull in New Projects from S3
To refresh the list of S3 project files available in the working directory, run the B23r command
refreshS3() function. This function will pull in any new files from the
/B23/r-projects folder in your S3 bucket.
About the Author: Kelly O’Briant is a data scientist and lead R package maintainer at B23 LLC . She received her M.S. in Computational Science and Informatics from George Mason University. Kelly is a founder and co-organizer of the Washington DC chapter for R-Ladies Global.