Saturday, 31 October 2015

Ansible's for loops

Recently, me and a team hit a big limitation with Ansible that really took me by surprise.  For loops, can only iterate over a single task and not roles or playbooks.  This post will discuss why it was a surprise, what we did to get around it and (briefly) what Ansible 2 introduces.

Pre Ansible 2


The original issue really took me by surprise since the documentation on for loops showed a huge number of possibilities (including conditionals, looping over arrays etc etc) that suggested to me it was Turing-Complete and easily capable of meeting all our looping needs.  So when I needed to assign a role multiple times (dynamically from a list) I was shocked to find that it couldn't be done.

I'm not the only one who was caught out by this.  See here, here and here.

Why this wasn't supported really puzzled me.  Some on my team concluded that it must be due to the way Ansible works internally and would perhaps require a lot of re-writing to support the feature.  However, Michael DeHaan said here that there was a "fundamental reason" it couldn't be done:

"...the role must be applied in the same definition (but not the same variables) to each host in the host group. Ergo, you can't have different numbers of roles applied to different hosts in the same group based on some kind of inventory variance."

But I don't understand why you can't.  The hack here suggests to me that you can.

Should the docs have told me what's not possible?


I also wondered if the ansible documentaion on loops should have mentioned its limitations.  Funnily enough, Michael DeHaan wrote an article on Technical Documentation As Marketing which said "your documentation and front material is there to both be honest, communicate how to use something, and also sell what you are doing".  The docs certainly didn't lie to me, but they could have been more honest about limitations.

Workaround

We looked at a few options:

  1. Hack - Way too ugly!
  2. Loop over each task within a role separately - This was not possible for us since some tasks depended on output from previous tasks. 
  3. Convert role (or set of tasks) into a single task - No appetite to learn and code more ansible specific stuff at a time when we felt so let down by it!
  4. Move logic away from ansible into something external that ansible calls (e.g. a bash script).  
We opted for the least bad, which to us was option four.  


Post Ansible 2

According to slide 15 here, in Ansible 2 you can loop over multiple tasks which is really good to hear.  However, it seems you can only loop over tasks, and not roles.  I don't understand why you can't loop over roles, but am really encouraged to see the feature included.  It's also great to hear that Ansible 2 is backwards compatible with version 1.  I look forward to using it. 










Tuesday, 15 September 2015

A Service shouldn't know about its Environment


Recently I was involved with setting up a build and deploy system for a suite of microservices. Each service was bundled into its own Docker image and deployed by Ansible.  One of the biggest lessons I learnt is the subject of this post.

Background


  • Project encompasses around twenty microservices (this figure only set to increase).
  • All packaged and deployed as Docker images.
  • Deployment by Ansible.   
  • Some services need to be deployed to different machines on different networks since some are public facing, some backend, some internal etc etc.
  • Different environment variables need to be injected into different running Docker containers.
  • Each service lives in its own git repo.
  • Each service built by CI server which runs project tests and creates a Docker image. 
  • Big effort to minimise the amount of work/config required to create a new microservice.

The Problem


Where to store the ansible deployment configuration for each service.

 

First idea: Each service owns its deployment configuration.


At first we had the idea of each service being in control of its own deployment destiny.  This involved each project storing its ansible config in its git repo.  I liked this since it meant that everything you needed to know about a service was in one place.  It was meant to be great for auditing and keeping track of changes since you only needed to worry about one project. 

It all sounded great in theory. However, we soon discovered it was in-practical for a number of reasons.

Making changes to deployment config is hard when it's defined in multiple repos.


The first sign this was a bad idea was that we found ourselves having to make changes in many different projects when we wanted to change the deployment mechanism. For example, say we wanted to run a command after every deploy, this would require changing the common Ansible role responsible for deployment and then updating each service to use the new version of that role. This would then require us to rebuild each project with the pointless step of building and tagging a new Docker image that was identical to the last (since no changes to the Ansible deployment config affected the Docker image for a given service). With the number of microservices set to increase, this problem was only forecast to get worse.

There's no central place which describes your environment.


Answering the question "What else runs on network/machine X?" becomes impossible as you have to check an ever growing number of service's repos.

Dependency is the wrong way around


If each project has its own deployment config, it is in affect describing a subset of the environment.  In our project, this seemed both impractical and inflexible.  In retrospect, a service knowing about its environment seems like a dependency the wrong way round. To explain this, consider the following relationships of the entities composing a typical Dockerized Java Web App.



The diagram above shows the relationships between various entities in two projects.  Each arrow can be read as "depends on".  This diagram also shows that the "Environment" can't be drawn as a single entity in its own right as it's distributed among multiple projects.

Looking at the above from right to left:
  • A library is imported by a webapp. 
  • A webapp is built into a Docker image.
  • The Ansible config deploys Docker Containers of specified Docker images.
  • The Ansible config describes a subset of the environment. 
The arrows that point from the Ansible config to the Environment feel out of place and wrong!  In fact, if any of the above left to right pointing arrows were reversed, we'd have a design problem:

  • If a library knew about the webapp using it, it wouldn't be reusable. 
  • If the webapp knew it was deployed in Docker, it would require re-work to deploy it outside of Docker. 
Since the Project has Ansible config that defines the Environment, it means that the Project is coupled to its environment and deployment implementation.  The same could be said for the project's relationship with Docker.  The webapp is coupled to Docker via it's corresponding Dockerfile which means replacing Docker with an alternative could be arduous.  However, this coupling seemed OK to us at the time since the Environment was in much more of a state of flux than our Docker containers.

Second idea: Deployment config in its own project


With this idea, we have each project only responsible for defining its Docker image and not how it's deployed.  No Ansible config lives in any microservice repo, it all lives in another project which is used to deploy everything.

The diagram can now be re-drawn as follows:



Here we see the Environment as a fully fledged entity with its own git repo.  The project knows nothing about where or how its deployed.  The project is only responsible for building Docker images.

After switching from the first idea to the second, we quickly found that this made more sense from a conceptual level and was easier practically.

Why this was better for us:



  • The Environment config could be understood in full by looking at one repo.
  • All Ansible roles that we had created were then moved into this one repo which also helped with exploring the config.
  • Making changes to the deployment config was far easier being in one place.
  • When changing the config, it was less likely to require effort that was proportional to the number of microservices. 

Tuesday, 21 July 2015

Notes and thoughts from Ansible meetup - July 2015

Here are my notes and thoughts on the Ansible meetup hosted by Equal Experts on 16th July 2015.  I benefited hugely from the event since I got to here about a  technology that I thought I knew reasonably well being used in a completely different way than I had ever considered. 

Talk 1: Debugging custom Ansible modules - Alex Leonhardt

 

Unfortunately I missed this as I left work late :
His slides can be found here: http://www.slideshare.net/aleonhardt/debugging-ansible-modules

Talk 2: AWS, CloudFormation, CodeDeploy, and some Ansible fun - Oskar Pearson


Oskar Pearson talked to us about how the deployment of the E-Petition's website works.  The entire deployment config has been open sourced on github here so you can work out exactly how it has been done.  Oskar's full presentation can be found here but a very brief summary is shown below so that my thoughts have some background context.

Problem Domain:
  • Busy Site.
  • No permanent Ops team.
  • Security & Uptime is important.
  • Cost is a factor
  • Went AWS.
Principles
  • AWS isn't a hosting provider [it's a platform] - so do things the AWS way to get the maximum benefit.
  • Disposable servers.
  • Tried to avoid building things themselves.
Architecture details
  • ELB in front of webservers.
  • Webservers in auto-scaling group.
  • S3 storage used for code deploys.
Deployment details:
  • Ansible is the glue that holds everything together...  It builds servers, configures services (e.g. NGINX) and sets up environment variables. 
  • Releases done by copying tar file to S3 and then CodeDeploy invoked to deploy it.

What I found most interesting... Ansible with Disposable Servers

 

I have been familiar with the concept of disposable servers (also known as phoenix servers or immutable servers) and love the benefits they bring.  I have worked in an environment with snowflake servers in the past and seen all of the terrible problems they introduce.  That is why I became such a huge fan of Docker when I started working with it.

However, what I didn't appreciate (until Oskar's talk) was that it is both possible and practical to use tools such as Ansible in this way if the following statement holds true: Ansible only ever runs once on a given VM.  If this always holds true, you get all of the awesome benefits of disposable servers.  In the case of E-Petition's setup, this statement does hold true.

Talk 3: Ask Ansible - Mark Phillips 

 

Mark Phillips then took the floor to field questions from people using Ansible.  Not only did Mark clearly have a huge amount of Ansible knowledge, the room was packed with other experienced Ansible users who also chipped in when needed.

Do you need tests for Ansible roles?

 

First question..."Puppet has things like beaker to help with testing.  How do you test your Ansible config?".  Mark's opinion was that testing is achieved by simply applying the config and seeing if it works.  Mark went on to argue that testing roles (or some other config in isolation) has limited value since you won't know if the entire setup works until you apply it for real, to a real (test) machine.

I can understand this view and definitely recommend that end to end testing is performed, but I'm of the opinion that more is sometimes needed.

If an Ansible role is responsible for just installing a service - It's tempting to agree with Mark... it ether works or it doesn't and if it doesn't it should error in an immediately obvious way.  However, I think that Ansible roles (like anything in IT) can sometimes fail in nasty subtle ways that can go undetected for a long time.  The earlier these issues are detected the less damage they cause.  Also, the quicker they are identified the easier they are to understand and fix.  You are more likely to associate an issue with something that you changed five minutes ago than one from two weeks ago.

Ansible roles don't always obviously fail - they can fail silently



Suppose you have added the following line to a task...
ignore_errors: yes
 
... when you shouldn't have (perhaps due to a copy paste mistake).  In this situation a failure might not be detected immediately.  Or perhaps the ignore_errors was there for a certain expected error but actually fails for a completely different reason?  This has happened to me!  It's fair to argue that my role was coded badly (as my colleague delighted in informing me!)...  However, some tests for my role might have prevented this issue from escaping our QA process and lurking in production silently.

In the absence of specific tests for an ansible role, you have to rely on other tests catching your mistakes.  If these tests are reliable and honoured (i.e. no one deploys further up the pipeline if the tests are red) then I think that's OK.  However, I'd still argue for tests that target specific roles in order to quickly identify what is at fault.  Achieving this in practice is hard...

How to test ansible roles (if you wanted to that is)?


I actually wrote about this a while ago.  Essentially you can apply your ansible role to a docker container.  This in theory should work fine.  Sadly I have found problems with with this in practice.  Finding a docker container that mimics my target VMs distribution (CentOS7) with systemd running is non-trivial.  To be honest I can't remember all the issues I had, but I remember pairing with a colleague and eventually (sadly) giving up.  But the theory's sound!  Perhaps the solution to my problem is to test my ansible roles against something a little heavier than a docker container like a Vagrant provisioned VM.



Ansible with strict SSH host key checking?

 

I then asked Mark about the practicalities of using Ansible with strict SSH host key checking.  Essentially I wanted Ansible to verify the ssh public key of a target host from a pre-defined value.  I wanted to know if there was a better way than my current solution as detailed below....

  • Store the ssh public key for each host in the inventory (currently using flat file inventories) like so.... 
[web_host]
10.1.1.1 public_ssh_rsa_key="....the-key..."
  • Run the following bash script, which will generate a file named
    known_ansible_hosts....
#!/bin/bash
set -euo pipefail
IFS=$'\n\t'

ANSIBLE_SSH_KNOWN_HOSTS_FILE="${HOME}/.ssh/known_ansible_hosts"
INVENTORY_DIR="./inventories"

cat ${INVENTORY_DIR}/* | grep public_ssh_rsa_key | perl -ne 'm/"(.+)?"/ && print $1."\n" ' > "${ANSIBLE_SSH_KNOWN_HOSTS_FILE}"
  • Create an ansible_ssh.config file that references that file for its UserKnownHostsFile:
UserKnownHostsFile ~/.ssh/known_ansible_hosts
  • Create an ansible.cfg file which references the ansible_ssh.config file:
[ssh_connection]
ssh_args = -F ansible_ssh.config
 
...as you can see it's a bit long winded.

Answers to this question came fast... I'd guess about twenty brains were thinking of solutions and my one brain couldn't keep up with the proposals.

Oskar mentioned an ansible role on his github account which configures sshd on the destination server.  Although this is closely related, I don't think it's solving quite the same issue as my problem is getting the client to trust the server and not the other way around.  Anyone who does have a better solution... please feel free to comment!


Post Talk Conversations

Ansible's last job when setting up a VM - Disable SSH?  High security and the ultimate immutable server!


Me and a small group got chatting with Oskar afterwards and (amongst other things) debated the possibility of ansible disabling ssh as its last task.  This would in affect permanently restrict access to the VM and create the ultimate immutable server!  I think we both agreed that this would be awesome but would require a lot of faith in:

  • Monitoring - any live site issue would have to be debugged by your monitoring alone... no running telnet or curl from your live VM.
  • Build and deploy pipeline - You'll have to fix (or debug) the issue without manual changes and only using regular releases.
If disabling ssh sounds too risky for your setup (as it does for me currently) then  it could be used as extra motivation to improve the above list.  The next time I ssh on to a VM and do something manually - I will try and ask myself "How can this be automated so I can switch off ssh access?"  


All in all, this was a fantastic meetup! I learnt a lot and it really got me thinking about improving my deployments.  Thanks again to all involved.
 

Sunday, 3 May 2015

How to write end to end tests for NGINX using Docker and WireMock

A reverse proxy such as NGINX or Apache seems to be a common feature of many systems, yet they often lack the tests associated with the business logic in the application(s) it proxies.  Here I will show how, with the benefit of Docker and WireMock, to write tests that verify your config in isolation.

Git project here.

The Problem

 

Without tests for a reverse proxy, bugs are found on deployment in a real environment with some other unrelated suite of acceptance tests failing.  This has many drawbacks:

  • Bugs are found later (typically after committing, building and deploying) which increases your edit, compile, debug cycle.
  • It's impossible to approach your development in a TDD fashion.
  • Without tests in the same code base, you have to rely on comments or documentation to describe the intended behaviour.

The Solution 

 

Test your proxy in isolation, end to end, using WireMock and Docker.

Docker allows us to build and run the proxy in a reliable and repeatable way.  WireMock allows us to start a mocked application server that the proxy (NGINX in this example) forwards requests to.


 

Benefits

  • Can verify config fast.  The ./buildRunAndTest.sh script takes from eight to twelve seconds on my machine.  I'm sure with more effort this could be optimized.
  • Tests can be run by CI server such as Jenkins or TeamCity on every commit.
  • Bugs found earlier.

Saturday, 4 April 2015

How to test ansible roles with docker

Ansible roles can be difficult to write tests for.  They change the state of a machine and may not be able to run on the distribution of linux that your development or CI machine runs.   Docker solves both of these problems very nicely. Here I will discuss how with an example project on github.

Background

Before adopting this strategy I found myself testing ansible roles on the actual machine they were being deployed to.  This is bad because:

  • It's error prone (previous failed attempts can leave the machine your testing on in a state that is invalid).
  • You have to write more code to rollback what ansible has done so you can test again.
  • Not being able to run locally increases you edit, compile, debug cycle.
  • In order to automate this for CI you need to set aside a machine/VM.

How to test an ansible role

Jeff Geerling has a great blog post on testing ansible roles with travis ci which was adapted from his book Ansible for DevOps.  His post shows you how to test that:

  • role syntax is ok.
  • role can be applied to localhost.
  • role is idempotent (i.e. can be applied again to localhost resulting in zero changes).

This is a great approach.  However it doesn't quite work for me since:

  • I'm using a different distribution of linux on my dev machine to my production machines that my roles are designed for.   
  • Even if I could run them locally, I don't want to as I'll have to undo the actions in order to run the tests again.
  • I'm not using travic ci.

Testing ansible role with docker

See my sample project on github.  In short, this involves the exact same steps as above but just replacing localhost with a docker container.

Diagram showing flow of bash script calling ansible which applies the role to a docker container


This diagram does not show the complete steps which are as follows:

  • Test role syntax locally.
  • Test role can be applied to docker container via ssh.
  • Re-apply the role to the docker container and check it results in zero changes.
  • Check that the docker container is in the correct state.  

All I really needed was to write the bash script and find a docker image that was "red hat like" that came with ssh.  I found the tutum centos docker image which was perfect.

This testing approach has a number of benefits:
  • Can apply role to a different linux distribution than your localhost.  For example if you use a Debian Linux distribution locally but want to test your role against a CentOs distribution, this is possible by using a docker image that mimics the distribution of your choice.  This can have the affect of simplifying your role as you only have to worry about one platform (e.g. not having to use the ansible apt and yum module with when conditions as per this article)
  • Since the docker container is destroyed on the next run of the test, it can be run multiple times without prior runs affecting future runs.
  • Testing via ssh and not localhost.  Ansible should behave the same way when applying a role to localhost as for ssh, but testing via ssh is more likely a closer test to your real world setup.
  • Test can be extended to run against different docker containers.  Perhaps you have different machines in different states which your role will need to handle, this can be tested by applying your role to different docker containers.
The drawbacks:
  • You need docker installed.
  • Finding a docker image that is in the correct state can be hard, might require you to create one yourself.
  • My bash script is a little complicated and might be hard to maintain!  

 

Monday, 30 March 2015

Nagios vs Sensu vs Icinga2

Choosing a suitable monitoring framework for your system is important.  If you get it wrong you might find yourself having to re-write your checks and setup something different (most likely) at great cost.  Recently I looked into a few monitoring frameworks for a system and came to a few conclusions which I'll share below.

Background

The System
At the time of this investigation, the system (which has a microservices architecture) was in the process of being "productionised".  It had no monitoring in place and had never been supported in production.  The plan was to introduce monitoring so that it could be supported and monitored 24x7 with the hope of achieving minimal downtime.

Warning - I am biased
Before we get started, I have to acknowledge a few biases I have.  I have worked with nagios in the past and found it to be  bit of a pain.  However, this was probably due to the fact we created our checks in puppet which added an extra layer of complexity to an already high learning curve.  I decided to re-evaluate nagios because (a) we'd be creating our monitoring checks directly and (b) nagios has moved on since.
I think I might also be biased to favour newer technologies over older ones for no better reason that I'm currently at a startup who are working with a lot of new technologies.
  

Requirements

As follows:
  1. Highly scalable in terms of:
    1. handling complexity (presenting a large number of checks in a way that's easy to understand).
    2. handling load (can support lots of hosts with lots of checks).
  2. Secure.
  3. Good UI with:
    1. Access to historical alerts.
    2. Able to switch off alerts temporarily - possibly with comments e.g. “Ignoring unused failing web node - not causing an issue".
  4. Easy to extend/change.
    1. Ability to define custom checks.
    2. Ability to add descriptive text to an alert e.g. “If this check fails it means users of our site won't be able to...”>
    3. Easy to adjust alarm thresholds.   
  5. Good support for check dependencies.
    1. This is related to requirement 1.1) - Ideally the monitoring system will be able to help the user separate cause from affect.  When you have 100s of alerts firing it becomes hard to establish the underlying cause (see earlier post on this).  Without alert dependencies, the more alerts you add the more you increase the risk of confusion during an incident.  This is hugely important for your users when it comes to fixing a problem in 5 minutes instead of 30!

Nagios

Nagios is very popular, can do everything, but comes with several drawbacks.  For my proof of concept I extended Brian Goff's docker-nagios image. 

Pros:
  • Very popular so lots of support.
  • Huge number of features.
  • Good documentation.
Cons:
  • High learning curve due to its number of features.  This applies to both navigating the UI and writing checks.
  • Creating check dependencies is cumbersome as you have to reference checks via their service_description field.  This means you either use the description like an ID (i.e. not a description) or you duplicate your description in all the places you reference (depend on) your checks.
  • Creating checks with a check frequency of higher than 60 seconds involves a "proceed at your own risk" disclaimer  "I have not really tested other values for this variable, so proceed at your own risk if you decide to do so!" See here
  • UI feels old (at least it does to me).   I remember being very frustrated with the use of frames for the dashboard which mean it's hard to send people links and if you hit F5 to refresh, it takes you back to the homepage.

 

Sensu 

Sensu is a lightweight framework that's simple to extend and use.  I used Hiroaki Sano's sensu docker image to get my proof of concept up and running.

Pros:
Cons:
  • UI has a feature called stash which I don't understand and doesn't seem to be documented.
  • Dependencies can be configured however they seem to have little affect on the dashboard... see issue I raised on github here.
  • Documentation is great for a beginners guide and walkthrough.  I learnt the basics very quickly.  However I quickly found the need for a specification page which detailed exactly what the json could/could not contain.  This will be fixed soon hopefully: https://github.com/sensu/sensu-docs/issues/192 
  • Could not associate descriptive test with my checks e.g. "This checks for connectivity to the database which is required for..."

Icinga2

Originally a fork of the nagios project (and now a complete re-write), this framework has a huge number of features and a good looking dashboard.  I used Jordan Jethwa's icinga2 docker image 

Pros:

Cons:
  • Found the documentation hard to understand at first - this is related to the high learning curve.
  • Can assign notes / free text to alerts but the dashboard seems to only present it at quite a low level.  I couldn't find a way of customising the dashboard to display my "notes"  This could be customised at the dashboard level (via some feature I missed or is perhaps undocumented). 
 

Chosen Framework: icinga2

Sensu was sadly discarded because we felt there was a risk it would not scale in terms of handling the future complexity of the system.  This was mainly due to the apparent lack of support for dependencies.  The gaps in the documentation also made it feel like it wasn't quite ready to be adopted.  I'm definitely hopeful for sensu's future and look forward to seeing how it develops... it's definitely one to watch.

Nagios vs Icinga2.
They both have:
  • the same number of compatible plugins.   
  • lots of features at the cost of a high learning curve.
  • both handle dependencies (so should scale well in terms of complexity).
The differences:
  • icinga2 has a nicer UI - it feels more responsive.
  • dynamically creating objects and their relationships with conditionals (I think) should result in less boiler plate and copy pasted code which I have seen with nagios in the past.
Hopefully this will prove a good decision for the project!

Docker is great for these investigations

Being able to run someone else's docker image which gets an entire monitoring framework up, running and accessible in your browser in less than a minute is amazing.  Not only that, but you can also easily add/edit where necessary to get a feel for development.  It removes all the unwanted complexity in following setup guides.  Thanks to Brian Goff, Jordan Jethwa and Hiroaki Sano for their docker images.

If I missed anything...

Let me know.  I only had limited time and may well have missed some killer feature or another monitoring framework entirely that's way better than all of the above.

Thursday, 19 March 2015

Ansible - Sharing Inventories

Recently I needed to share an Ansible inventory with multiple playbooks in different code repos.  I couldn't find a way to do this nicely without writing some new code which I'll share here.

Trisha Gee sums up how I feel about this...  "I'm doing something that no-one else seems to be doing (or talking about). That either means I'm doing something cool, or something stupid."  Perhaps I missed an ansible feature, if so, please let me know!  I am new to it.  If not, perhaps ansible could add support for this.

EDIT 19/7/2015

I'm now of the opinion that this is a bad idea.  Regarding the Trisha Gee quote above - I now know what it was I was doing (hint... not cool).  

I am no longer sharing inventories as now all of our ansible deployment config (except a few roles) are in one git repo.  This same repo holds our inventories so no need for sharing.  If I did have to share them, I'd use git sub modules. 
The solution detailed here is not a good idea, but will leave it here anyway!

EDIT 30/3/2015

Git Sub Modules offer a better alternative... http://git-scm.com/book/en/v2/Git-Tools-Submodules
Not entirely sure (only just learnt about this) but thought I'd share.


Why do I need to share inventories?

 

We currently describe our architecture using static inventories.  One file per environment.  As per the examples on the ansible docs, we define things like [databases], [webservers] and [applications].  I want all of this in one file and one code repo so that it's all defined in one place.  The project that builds and deploys the web layer is separate to the application layer.  I did consider separating inventories by environment AND layer, e.g. test-web-inventory and test-application-invetory.  However, because the web nodes needs to know about the application nodes and likewise application to database that would lead to duplicating data. 

Why not use a dynamic inventory?

 

It feels too much too soon.  The architecture is currently quite simple and I don't want to introduce more complexity (and dependencies) into the build and deploy pipeline until I have to.  A flat file is reassuringly simple.  Once the architecture grows and the flat file becomes untenable, then I think it will be time for a dynamic inventory.

How I shared inventories

 

Summary

 

I put the inventory files within a shared role, that gets assigned to localhost only by the specific playbooks that need it.  That way, once we have installed the playbook's requirements (using ansible-galaxy install), we have the specific version of the inventory files on the file system.  Then we can run our playbooks referencing the freshly retrieved files.

That might sound ugly, but it was that or custom bash scripts that check things out of git.

In more detail

 

The shared-inventory-role I created looks like this:

├ files
├── dev-inventory
├── test-inventory
├── prod-inventory
├ meta
├── main.yml (standard for describing role)
├ tasks
├── main.yml (do nothing task - maybe not required)

Each -inventory file containing the full inventory that describes that environment.  This role is in it's own git repo.

The playbooks that then import this role (e.g. the playbook for the application servers) do so by referencing it in their requirements.yml file as follows:

- src: ssh://git@github.com/company/ansible-shared-inventory-role.git
  scm: git
  version: ac1c49302dffb8b7d261df1c9199815a9590c480
  path: build 

  name: shared-inventory

(note that I can reference a specific version)

Then, when we come to running the playbook, we simply install the playbook's requirements first as follows:

ansible-galaxy install --force -r requirements.yml

This forcefully installs the requirements (i.e. overwrites whatever is currently there) into the build directory.

Then we can apply our playbook with our inventory as follows:

ansible-playbook site.yml -i build/shared-inventory/files/test-env


Thoughts?

  • Have I missed a simpler way to do this?
  • Is this a feature that could/should be added to ansible?




Monday, 16 February 2015

How to connect your docker container to a service on the parent host

Recently I needed a docker container to access a service running on the parent host.  A lot of the examples on the internet were very close to what I needed, but not quite.

First off, docker's documentation provided an example that (as far as I can see) doesn't actually work.  The following (I think) is meant to inject the src ip address of the parent host's default interface:

$ alias hostip="ip route show 0.0.0.0/0 | grep -Eo 'via \S+' | awk '{ print \$2 }'"
$ docker run  --add-host=docker:$(hostip) --rm -it debian


What it actually does is inject the ip of the default gateway on the parent host which doesn't help us.

Michael Hamrah's blog details the "gateway approach" of getting the parent's host ip from within the docker container.  This works great, although I wanted to inject the ip on container run and not have that logic within my docker image.

Inject the src ip of the docker interface


I think the best way is to get the source ip address of the docker0 interface which as described here is a "virtual Ethernet bridge that... lets containers communicate both with the host machine and with each other."  This interface has been created by docker for this purpose so it seems a good choice.  This does however assume that you are using the default docker0 interface and have not configured docker to use your own bridge.

To get the src IP of the docker0 interface, run the following in bash:

$ ip route show | grep docker0 | awk '{print $9}'
172.17.42.1


So, we run our docker image as follows:

$ docker run -d --add-host=parent-host:`ip route show | grep docker0 | awk '{print \$9}'` myusername/mydockerimage


Now, in your docker container, an entry in the /etc/hosts file will appear as follows:

172.17.42.1    parent-host

Tell docker not to containerise your networking



An alternative approach is to use --net="host", this works fine.  You can then access your service running on the parent host using localhost.  However this option does not allow you to also --link containers as you get the following error:

Conflicting options: --net=host can't be used with links. This would result in undefined behavior.

Further Reading


Networking is not my strong point.  I have learnt a lot about networking in Linux recently from Martin A. Brown's awesome Guide to IP Layer Network Administration with Linux.

I also came across a project called Pipework from Jérôme Petazzoni

(creator of docker in docker) that allows even more network customisation than docker allows out of the box.

Monday, 2 February 2015

How to test for connection leaks

Connection leaks are easy to introduce, hard to spot, can go undetected for a long time and have disastrous consequences.

It's not enough to fix a connection leak or code against them, you must write tests that would fail in their presence.  Here I will discuss a number of techniques that could apply to many different languages and frameworks.  I have also provided some concrete examples on github which make use of Dropwizard, JerseyClient and WireMock.

What is a connection leak?


Each time system A talks to system B over a protocol like http, it requires a TCP/IP connection to send and receive data.  Connections can be re-used for different requests, via a connection pool, or a new connection can be established for each request.  Once system A has finished with the connection, it needs to either return the connection to the pool, or close it.  If this doesn't happen you have a connection leak.  

Given a connection leak, everything will continue to work just fine until a new connection cannot be created.  In the connection pool scenario, this happens when the number of active connections reaches the maximum permitted for the pool.  In the scenario of not using a pool (i.e. creating a new connection each time) this will go on until either system B refuses to maintain any more active connections or the underlying OS of system A refuses to allocate any more connections to your code.  

Connection leaks are ticking time bombs!  

  • How quick the fuse burns depends on how many times the leaky code is invoked.
  • How long the fuse is, depends on how many connections can be maintained.
  • How much destruction it causes depends on the architecture and how quick you can restart your app!   
As if that wasn't enough, a connection leak will manifest itself in different ways depending on your setup.  Threads could hang, different exceptions can get thrown which can just increase confusion, especially during the panic of a live site issue.

How to test for connection leaks...

Put another way, how do you write a test that will fail due to a connection leak?  There are a few different ways this can be achieved.  The following are ordered by my preference.



Examine the connection pool after your test and ensure that there are no leased connections

Assuming you can inject a http client into your code which uses a connection pool you can examine, this allows you to test for leaks easily.  Once your test has completed, all connections should have been returned to the pool.  If not, your code has failed to release a connection and you have a leak.


Concrete example here:

IntegrationTestThatExaminesConnectionPoolBeforeAndAfterRun.java

In the example above, we setup WireMock to stub the service that our code will call (see the diagram here).  Then, we create an instance of Jersey Client's
ApacheHttpClient4 class that allows an underlying instance of HttpClient to be used.  This allows us to create a connection pool that gets used by our code we can later examine.  Once our test has completed, we simply assert that there are no leased connections:

@After
public void checkLeasedConnectionsIsZero() {
assertThat(poolingHttpClientConnectionManager.getTotalStats().getLeased(), is(0));
}

Benefits of this approach:
  • We test the full http stack.
  • We add very little time to the build.
  • If there is a connection leak we narrow it down very quickly to a specific test.
  • You don't even have to use a connection pool in your production code (if you don't want to for whatever reason).



Acceptance test that examines connection pool before and after your code runs

As per the above technique, this relies on examining the connection pool for leased connections.  The difference here is that this is an acceptance test which allows us to test our application end to end.  Given we are looking at the real application, we must expose the connection pool somehow.

Concrete example here:

AcceptanceTestThatExaminesConnectionPoolBeforeAndAfterRun.java 

The Dropwizard framework makes this really easy to achieve. If you create your instance of JerseyClient using the io.dropwizard.client.JerseyClientBuilder (which is great as it sets sensible defaults on all timeouts) you get useful metrics on the connection pool (which is used by the instance of for free. These metrics are exposed by dropwizard on the metrics page of your application. Examining the connection pool's status can then be performed trivially during acceptance tests

Benefits of this approach:

  • Could be used for manual testing.  If you know that your application is not doing anything, all connection pools should have zero leased connections.  If you are unlucky to find that there is a leased connection, you know you have a connection leak, but you might not know where!

Drawback of this approach:
  • Not thread safe.  Running tests in parallel could cause tests to fail incorrectly.  It could be that two connection-leak-free clients are calling your code at the same time. 
  • Forces you to use a connection pool that you expose metrics on.  Perhaps you don't want to do this, but it can be a huge benefit in diagnosing live site issues.  Monitoring tools (such as Graphite) can alert you to issues before they impact your users.  


Unit test your code with mocked dependencies

Connection leaks occur when you forget to call a method, so you can write a test which verifies the correct method gets called.

Concrete example here:
UnitTestThatShowsUnderlyingOSRefusingToAllocateMoreConnections.java

Drawbacks of this approach:
  • Only verifies that you have called methods you think you need to call to avoid a connection leak.  Working with high level libraries is great as it abstracts the complexities of low level protocols like TCP/IP.  However, verifying that you don't have a connection leak whilst staying at that higher level of abstraction (without looking the under the hood at connections and pools) could lead to more incorrect assumptions.

 

Set the connection pool's max size to one, then run your test twice.

This is a viable option but has a lot of drawbacks.  Here we start our application with a deliberately small connection pool so that a connection leak will become evident after two runs.  If a connection leak is present, the test fails when the client code attempts to access a connection from the pool.  In other words, the client doesn't even attempt to open a connection on the 2nd run since it's only allowed to have a maximum of one open at any point in time.

Concrete example here: 

AcceptanceTestThatRunsMoreTimesThanConnectionsInPool.java

This test makes use of a junit rule I have lovingly copy pasted from Frank Appel that I found here.  Dropwizard makes this easy since we just start the application with a different yml config file that has been deliberately named to scare people using it in real environments (i.e. config-with-connection-pool-of-size-one.yml).  In this example, our leaky code will make this test fail with a org.apache.http.conn.ConnectionPoolTimeoutException Timeout waiting for connection from pool. 

Benefits
  • Don't have to expose your underlying connection pool to the test code.  
  • Assuming that connection pool timeouts have been set, this approach should allow parallel running of tests without false failures.  In other words, two connection-leak-free clients should still eventually pass if they happen to use the same pool at the same time.
Drawbacks:
  • Adds precious time to your build.  I agree with Dan Bodart that builds should be crazy fast.  Doubling up your tests is not a good idea.
  • Easy to get wrong.  If you don't repeat the tests that exercise your code the tests will still pass.  The risk of accidentally running with a normal connection pool (i.e. a max size greater than 1) is mitigated by writing a test that asserts the connection pool is as expectedsee the test: givenApplicationStartedWithPoolSizeOne_whenMetricsViewed_ConnectionPoolIsOfSizeOne


 

Configure the server to only allow one open connection to be processed at a time and run your test twice

Again, like the above, this is a viable option but has a lot of drawbacks.  Here we start WireMock (which uses Jetty under the hood) with a deliberately tuned down number of JettyAcceptors and container threads forcing it to only allow one connection to be processed at one point in time.

Concrete example here:
AcceptanceTestThatRunsMoreTimesThanServerCanTakeConnectionsSimultaneously.java 

I first thought that this test would fail with a ConnectException when it tried to establish a second connection.  Not so.  It actually succeeds in establishing the second connection but then fails with a java.net.SocketTimeoutException.  This is because there aren't any threads free to process the request.


Benefits:
  • Don't have to expose connection pool to tests.
  • Works if you are not using a connection pool.
Drawbacks:
  • As per previous example, adds precious time to your build.

Final thought

My default preference is for the first option "Examine the connection pool after your test and ensure that there are no leased connections" because it adds a negligible amount of time to the build and actually looks at what's going on behind the scenes.  Your choice may depend on project specific factors or something I have not considered. 
Keep in mind that testing can only ever prove the presence of bugs.  While these techniques should increase your confidence of connection leak free code, they can't prove it.  Given that these issues can still get to production, finding them before they affect your users is the name of the game.  Exposing metrics on your connection pools is made very easy by Dropwizard and should definitely be encouraged.  Running soak tests and load tests can also help identify these issues, but obviously come at the cost of adding time to your build pipeline.  

Any comments, or alternative suggestions are welcome.  Many thanks to Rob Elliot, Tom Akehurst and Anurag Kapur for their input to this post.