Managing Distributed Data Products with Ansible and Ambari
A blog post describing our open source Ansible module for managing Apache Ambari clusters. -By Mark Bittmann
At B23, we believe in the use of automation for provisioning and configuring cloud resources to manage complex data pipelines, whether streaming or bulk processing. Data pipelines have evolved beyond the pattern of writing blocks of data to HDFS and running bulk operations. MapReduce is a freight truck, but you might need 5 Jeeps, 2 Jaguars, and a helicopter.
There now exist many tools for managing asynchronous, distributed data pipelines, with each tool designed for particular access patterns. These include Kafka, Storm, Spark, MapReduce, Hive, Elasticsearch, a handful of NoSQL databases, and yes the old fashioned relational database. While we are left with the paradox of choice, these options enable us to build precision data products tuned for a particular business problem. Systems often require multiple compute paradigms and access patterns. We use automation and cloud to reduce the complexity of such systems while at the same time enforcing strict policies for security and compliance.
At B23, we’ve come to know Ansible very well for instantiating and configuring distributed data products. We often use Apache Ambari for installing and configuring software stacks in the Hadoop ecosystem. It doesn’t manage every tool, but it really simplifies the configuration of the Hadoop ecosystem. Ambari has a very flexible RESTful API and a powerful UI. However, we found ourselves generating a lot of overlapping Ambari Blueprints. We also found ourselves repeating a lot of Ansible uri and wait_for calls to create, stop, start, and delete clusters in Ambari. Lastly, we were using a lot of Jinja2 to inject hostnames pulled midstream from Ansible’s dynamic inventory. To overcome this, we developed a custom ansible module for managing clusters, and B23 is pleased to make it available as open source:
I think the module has several benefits. First is the simplification of the blueprint. Right now, you can define an Ambari cluster in two JSON files: the blueprint and the host_map. The blueprint defines a generic cluster by mapping a set of cluster components (e.g., Namenode, Nimbus server) to groups (e.g., master, worker). The host_map then maps a set of hosts to those groups. While JSON is the language of REST, YAML is the language of Ansible, and so our custom module allows you to define a cluster with the compact, readable form of YAML. Here is an example blueprint defined entirely as Ansible variables:
Because we are defining our cluster with Ansible variables, we can further simplify and reuse cluster definitions by defining the overall service-component mappings as variables:
This allows us to express a cluster in the following concise and completely dynamic template:
Now consider the use of Ansible dynamic inventory. We can use EC2 tags to find all hosts with an Ambari_Group tag matching slave or master and populate the slave_hosts and master_hosts variables without needing to mess with changing hostnames or IP addresses.
Another advantage is that we included a module parameter to wait for a cluster request to complete. This moves the logic of waiting for Ambari to install, configure, and start all service components out of the Ansible playbook and into the python module. This is valuable in case you need all Spark services to be running before starting some Spark job in a handler.
The module does not yet support injecting service or component configuration into Ambari, which is a pretty killer feature of blueprints. We have plans to add that in the future.
The introduction of this module into our workflows has significantly reduced the amount of playbook logic and template files in our codebase. It also enables us to more quickly integrate systems that incorporate multiple tools in the ever growing big data ecosystem.
Mark is a Partner at B23. As a data scientist and technical executive for B23, Mark develops data products to inform business and customer decisions. Mark has built and managed teams with specialized expertise in advanced analytics, distributed processing, DevOps, and mobility. Mark has led the implementation of big data business applications, including engineering some of the largest known implementations of latent semantic indexing with Apache Hadoop. Mark is the author of a patent application for applying machine learning to natural language, and he has contributed to the Apache Spark project. Mark earned a BA in Physics from Georgetown University and an MS in Computer Science from Johns Hopkins University.