A Pattern for Managing Clusters in the Cloud with Ansible Vault

September 8th, 2016 David Kegley is a Software Engineer at B23 LLC If you’ve ever been tasked with standing up a cluster environment in the cloud, you’ve probably used a tool like Ansible to make your life a little easier. By defining the configuration in code, admins can develop their cloud infrastructure in a maintainable and repeatable way. This allows for iteration through trial and error when developing for cloud. One drawback of developing in this fashion is that it can be difficult to maintain access to a cluster during development. When large teams require access to a common cloud environment, security is often the first casualty in the development process. Maintaining an individual private key for each developer is impractical while developing a cluster for a large team. Cycling keys is a good idea but it means re-distributing the private key to the entire team every time a key is changed. Ansible provides some useful functionality that can alleviate some of these maintenance woes. Vault utilizes symmetric key encryption which allows us to store private keys in our repository while still providing an acceptable level of security for our cloud. Note that you should never store unencrypted private keys in source control. By storing these encrypted private keys in a repository, developers are able to access the cluster even if the private key has changed. We can still cycle the private key frequently without the worry of hindering development. Should a private key be misplaced, cycling keys is as simple as encrypting a new private key and replacing the old one in source control. So how is this accomplished? First, we will generate a random key to use as our vault password file. This key should only be shared with trusted individuals and it should never be committed to source control. Generate the key and place it in the keys/directory using the following command: There are many ways to generate random strings, I’m using date | md5 for the sake of simplicity but any method will work. Define your cluster for whichever...

Discovering Relationships in an AWS VPC using Ansible

A blog post describing our open source Ansible module for discovering relationships in an AWS Virtual Private Cloud. -By Dave Hirko   Launching in the Cloud? At B23 we build, implement, and configure distributed processing applications for Fortune 500 customers with sensitive data stored in The Cloud. This Fall, and in the run up to Amazon Web Services (AWS) Re:Invent marketing conference, we observed many companies, from many industries claiming that they could Launch in the Cloud. To us, Launching in the Cloud is about as ambiguous as the term Cloud itself. Having spent many years working with AWS technologies, we were curious… Yet Another Security Cloud Blog (YASCB) We started to observe that most of these applications were critically flawed in addressing basic security principles once they were Launched in the Cloud. It wasn’t that AWS was insecure, but that these applications were not using basic AWS services made available to them to enable basic security features. For example, most of the application EC2 hosts were assigned Public Internet Protocol (IP) addresses which made them accessible to anyone on the Internet. Unlike traditional networks that exhibit some form of defense-in-depth, they did not take advantage of AWS’ powerful software-defined networking (SDN) subnet and routing capabilities existing within a Virtual Private Cloud (VPC). In one egregious case, an application configured a Hadoop cluster where every node in the cluster was allocated a public Elastic IP address. For us, that’s either negligent or lazy, or both. Amazon’s Simple Storage Service, or S3, was another major security challenge for most of these Launched in the Cloud applications. S3 has a very robust policy engine that allows for almost any conceivable way to securing its data contents, yet we still continued to find improperly configured S3 buckets. Most of these applications using S3 relied upon manual implementation of security policies, making it one button-click away from having their contents exposed to the world. Not to mention that no...

Managing Distributed Data Products with Ansible and Ambari

A blog post describing our open source Ansible module for managing Apache Ambari clusters. -By Mark Bittmann   At B23, we believe in the use of automation for provisioning and configuring cloud resources to manage complex data pipelines, whether streaming or bulk processing. Data pipelines have evolved beyond the pattern of writing blocks of data to HDFS and running bulk operations. MapReduce is a freight truck, but you might need 5 Jeeps, 2 Jaguars, and a helicopter.   There now exist many tools for managing asynchronous, distributed data pipelines, with each tool designed for particular access patterns. These include Kafka, Storm, Spark, MapReduce, Hive, Elasticsearch, a handful of NoSQL databases, and yes the old fashioned relational database. While we are left with the paradox of choice, these options enable us to build precision data products tuned for a particular business problem. Systems often require multiple compute paradigms and access patterns. We use automation and cloud to reduce the complexity of such systems while at the same time enforcing strict policies for security and compliance.   At B23, we’ve come to know Ansible very well for instantiating and configuring distributed data products. We often use Apache Ambari for installing and configuring software stacks in the Hadoop ecosystem. It doesn’t manage every tool, but it really simplifies the configuration of the Hadoop ecosystem. Ambari has a very flexible RESTful API and a powerful UI. However, we found ourselves generating a lot of overlapping Ambari Blueprints. We also found ourselves repeating a lot of Ansible uri and wait_for calls to create, stop, start, and delete clusters in Ambari. Lastly, we were using a lot of Jinja2 to inject hostnames pulled midstream from Ansible’s dynamic inventory. To overcome this, we developed a custom ansible module for managing clusters, and B23 is pleased to make it available as open source: https://github.com/mbittmann/ambari-ansible-module   I think the module has several benefits. First is the...