Friday, 26 December 2014

Docker - Things I'd like to see fixed

Recently I've been lucky enough to spend some time with Docker.  I've really enjoyed it and would definitely consider myself a fan.  However I think a couple of issues really need fixing.

Docker registries should treat images as immutable entities

Currently docker registries (including docker hub) allow you to overwrite an image with the same tag.  This means you can't be sure that an image hasn't changed since you last pulled it.  This is a nightmare from a build and deploy point of view.  I can imagine a sequence of events whereby a small low risk change is pushed through the environments with plenty of testing, only to fail in production when suddenly a modified and incompatible dependent image is pulled and deployed for the first time.  Sadly it seems that this issue has been closed - presumably without a fix.


Registry locations and Image tags need to be separate entities

Uniquely identifying a specific docker image and a source/destination registry should be two different concepts, but with docker they are combined and seem confused.  

Docker images are specified by a tag which consists of user/image:version so running: 

docker pull someone/something

...will pull the latest something image as created by user someone from registry.hub.docker.com.  However, the user part also allows us to specify a different registry, perhaps our own locally available one.  So this means that

docker pull my.local.reg/something

...will pull something from my.local.reg.  This means that in order to push an image to a local registry you need to re-tag it to include the FQDN of my.local.reg as shown below...

docker tag someone/something my.local.reg/something

and then...

docker push my.local.reg/something

...now we've lost the user someone all together.  This seems strange to me for a number of reasons:

  • It implies that images are different if they have come from different registries which should not be the case.  
  • Your source code references your local registry's FQDN.  This is because Dockerfiles that are declared to be FROM an image need to specify the parent image in the same way.  
  • Your FQDN cannot ever be one word such as localhost as it will assume you are specifying username localhost on registry.hub.docker.com 
I'd prefer to keep these things separate as per Maven's GAV parameters with the registry (maven repo) specified in an external settings file.

Sunday, 30 November 2014

Does a long notice period hold back your organisation?

My last role had a long notice period of three months.  Although long notice periods are intended to protect an organisation, I think they can act as a detriment.

I remember being concerned when I took on the role that a three month notice period could make job hunting tricky as and when I wanted to move on.  I could easily imagine this being a deciding factor for a team that wanted someone to start as soon as possible, no matter how suitable I was.

"Don't worry, you can negotiate it down when you want to leave" ... I was told.  This didn't really console me as broaching this subject would reveal my intentions of leaving before the new role was agreed.  In any event, I signed the contract and agreed to the notice period.

When it was time for me to move on (after four great years at said company) I decided to hand in my notice before getting my new role agreed.  I also elected to work my notice period in full and did not ask to leave early.

I thought I could make the long notice period work in my favour and also gain more of a sense of control by setting my last day as opposed to it being up for debate.  Any extra feeling of control to help with the jump into the unknown of a new job felt good.  It was also good not having to sneak out for interviews, and being able to talk openly with colleagues about my job hunt.  The other contributing factor was that my new role was to be freelance which typically involves getting work at much shorter notice.  To some extent my hands were tied here

After completing my notice period and handing over my responsibilities, I've come to a few conclusions.

If the handover takes a long time - something is wrong that can't be fixed with a long notice period.

If it takes you more than a month to handover your responsibilities to other members of the team, it suggest to me that you might be a key man dependency.  This should be addressed with continuous knowledge sharing and proper collaboration... not long handovers.  Don't rely on the long notice period to fix this important issue.

Short notice period

Looking back, I'd argue for a short (one month) notice period for the following reasons:
  • It forces you to address the issue of key man dependencies because you can't fall back on the "get out of jail free card" of long multi-month handover sessions.  
  • A team member who wants to leave will never be as committed as someone who wants to stay.  Why keep them longer than absolutely necessary.
  • It might stop potential candidates from joining your organisation.
Discuss...

Friday, 17 October 2014

Change the characteristics of your complexity

I was lucky enough to attend a talk by John Turner from Paddy Power as part of London Continuous Delivery meetup.  The talk was very informative and gave the audience some great insight into rolling out PaaS and continuous delivery in the real world.

In order to innovate faster, Paddy Power made a number of changes to "re-orient" the company and put the engineer team first.  To help speed up engineering, where possible, decisions on architecture were removed and replaced with convention.

Pipeline

An abstract (no tools specified) pipeline was defined with the following stages:

1. Code commit

Trunk based commits with feature switches.

2. Build

Component and Integration tests
Code quality rules (enforced by sonar)

3. Acceptance Tests

Run against the application/service as delpoyed in a VM as provisioned by Gigaspace's Cloudify product.  Config for the VM's dependencies (e.g. java installation) is defined in a blueprint.  Acceptance tests in BDD style with JBehave and Thucydides.

4. Performance Test

The goal here is to see if there has been a significant degradation in performance.  The goal is not to try and measure absolute performance.  This is performed by JMeter.

5. Stability Test

At this stage, the acceptance are run with some deliberate random failures added to various dependencies.  This has been achieved by an in-house implementation of Netflix's SimianArmy.

6. Manual Test

This is some times required for new features, legal issues and also as John put it "that warm fuzzy feeling"

7. Production deploy


No Rollbacks

I have debated about this a few times with various people (see here for my thoughts) and got into the same debate with John after his talk.  John had a great analogy... rollbacks are like car airbags, they encourage drivers to take risks.  On the other hand, a spike on the steering wheel is your roll-forward only approach and encourages a lot more care :)

Other snippets

"You can't remove complexity from an inherently complex system.  You can however change the characteristics of the complexity, by distributing it" - This is the goal of moving to a microservices architecture.

Getting your organisation to change can be hard.  A technique John mentioned was to trial out the change on a small scale, prove it works, and then to try and convince others... sound advice!

Tuesday, 26 August 2014

Pull Requests encourage critique, even in small teams

A member of our team recently suggested that we use pull requests in our development workflow.  Many points were made for and against in our debate.  After only a short time of adopting this change, the major benefit I have seen is the increased critique of the team's output.

The debate

The decision to use pull requests was certainly not a given and gave rise to an interesting debate.  Many arguments were made for and against which I've tried to summarise below.

For pull requests

  • Increased knowledge sharing - Non reviewers can view pull requests just as easily.
  • Simpler commit messages - Each pull request is (hopefully) created "with the reviewer in mind" .
  • Provides a forum for commenting on work.
  • Increase in code quality


Against pull requests


"It's too much process and will slow things down"

As a small team I think we take pride in our lightweight process of pair programming and informal ad-hoc code reviews.  Adding extra checks and process to the deployment pipeline feels almost like a step backwards as we try to cut back on red-tape.  

Counter argument: No other rule would be introduced alongside using pull requests.  This means anyone in the team (except the author of the change) can accept anyone else's request.  Where possible, we'd aim for one pull request per user story.  Perhaps this argument/concern was a fear of pull requests just being the start of a new heavy weight approval process?


"Pull requests lead to bigger and later code merges than the smaller earlier merges of committing straight to master"  

Given that a pull request should encapsulate all changes for a single user-story (or task), this is somewhat inevitable. 

Counter argument: This issue is mitigated by staying focused on the task/story in hand.  If the developer(s) happened to get sidetracked and begin re-factoring, this should be split into a separate pull request.  If a pull request does encapsulate a single user story, task or refactor it becomes easier to review and understand changes made to the code base.


"We pair program... we don't need a extra reviewer"

Counter argument: A fresh pair of eyes can spot things that even a diligent pair might miss.


Benefits I have seen since adopting pull requests


They encourage critique in a way I didn't expect

Pull requests seem to frame a change as a question and not a statement.  My impression is that:

  • Committing to master states: "I am done with this".
  • Submitting a pull request asks: "What do you think?"

It is so much easier to make a comment or suggestion when someone asks for your opinion than when you initiate the dialogue yourself.  The more that comments and suggestion are encouraged, the better the team's output will get.





Tuesday, 29 July 2014

Alerts should treat your Ops team like the police... only contact them in real emergencies!

Every time an alert notifies your Ops team, there should be a real problem.  If there isn't a real problem, you're wasting their time, adding confusion and making it harder for them to respond to real incidents.

The problem we have: Too many alerts for non-issues and non-critical issues.

I sat with a member of our Ops team recently and was horrified to see how many notifications they received for our monitoring systems.  At times, it was almost impossible to make any sense from them due to the sheer volume of emails filling their inbox.  

Why is this such a problem?

Multiple reasons:

  • Each notification comes at a cost of lessening the impact of all other notifications.  Take this to the extreme, where Ops receive hundreds per day, the impact of an alert can be almost zero.
  • A false alarm is a distraction and can waste valuable minutes in debugging real issues.


How did we get here?

The short answer is: by diligently adding more alerts but not diligently reviewing their behaviour.  We (myself included) are guilty of the crime... Wasting Ops Time!  


Addressing the problem

This is not an easy problem to solve and will take time and effort to fix.  

Our Ops team is now (rightfully) putting pressure on the development teams to clear up the worst offending alerts.  They routinely send emails of the top ten "flapping" alerts (i.e. constantly cycling between problem and recovery) in an effort to reduce the spam.  Exposing the problem is definitely the first step towards solving it.

Review a weeks worth of your notifications... for each one ask yourself:

  • Does this alert reflect a real issue with our service that needs to be fixed?
  • Was this an issue that Ops really needed to know about?
  • What is the worst case scenario to end users given the alert?
  • Could the alert be safely set to only warn? (e.g. never notify, and only display a yellow/amber as opposed to red on your dashboard).

Some alerts will be a lot easier to fix than others, but you can easily prioritise by the number of notifications sent in a given period.  

Final point: Get included on your systems alerts (if not already), remove your email filters so they come straight to your inbox and feel Ops' pain.  Filters offer an all too convenient way of burying your head in the sand. 

Wednesday, 25 June 2014

Why run browser based acceptance tests as monitoring checks?

So far the services I have been involved with have had acceptance tests and monitoring checks.  For various reasons the tests and checks have been designed, developed and run in separate worlds.  This approach has lead to issues falling through the cracks and remaining undetected in live for too long.

Here I will explain our first pass at joining these two separate suites... running browser based acceptance tests as monitoring checks.

What we've done in the past... Separate tests and checks


Acceptance tests


A good suite of acceptance tests should test your service end to end, verify external code quality and run against real external dependencies.  Some acceptance test suites can tick all of these boxes without relying on browser based tests, this is great as much complexity is removed.  For other test suites, a browser is essential.  In past projects, the browser based acceptance tests have run against various non-production test/integration environments.  One project had them running in production however they were separate from the monitoring checks and not under the careful watch of the operations team.  The main reasons for this include:
  • Not having a virtual machine with full production set-up (i.e. highly available) that is capable of running a web browser.
  • Some test suites included tests that were deemed "impossible" to run in live.  This then banished the entire suite to non-live environments.


Monitoring checks


Good monitoring should be able to verify if the service is working and reveal underlying causes of issues when not working.  The monitoring checks we created in the past have used the framework Nagios and typically test the service's external dependencies at a low level (e.g. network connectivity, DNS) and high level (e.g. sending a http request to an api).  Where possible we also performed some simple acceptance tests, for example testing that the application returns an expected HTTP status code when requesting a given URL.  Verifying anything more than this can quickly become complicated especially when user journeys over multiple pages with javascript are involved.

The cracks between Acceptance Tests and Monitoring checks... 


With the monitoring checks and acceptance tests in place as described above, you're reasonably well covered.  However, take this as an example... what if a javascript bug in your user flow prevents users from completing their journey?  Your external dependencies are fine - no monitoring checks will fire.  Your acceptance tests won't save you if they aren't running, or being monitored, in production.

We have got around these cracks in the past by displaying current user traffic on a dashboard.  The idea being that if user traffic suddenly drops to zero we'd know something is wrong.  However, with some services it's very hard to define a threshold of user traffic which represents activity that is clearly abnormal.  If nobody has completed a given flow in an hour... is it broken or is it Christmas morning?  Setting the thresholds too aggressively can mean spamming your OPs team during low traffic, too conservative will mean it takes a long time to recognise an issue which reduces the value of having the check at all.

The New Approach


Joining up the two suites and running your acceptance tests as monitoring checks gives you far greater confidence that your service is working at any point in time.  I will create a separate post with the implementation details of this approach soon.


Disclaimer... 


I'll be honest, it's still early days for us with this approach and while we do have our tests running against production, we are currently only soft-live.  We haven't got Ops monitoring this yet.  However, I am confident this approach is viable and will have many benefits.  Hopefully I can update this blog with positive news and the team's learnings as we progress to hard-live.


Monday, 26 May 2014

Maven's great, before you get annoyed with a feature, find out why it was implemented.

I have always liked Apache Maven, however I recently found a few issues with it that irritated me.  This post was going to moan about them, but after further reading I have come to realise that these issues are not as simple as I first thought.  They are in fact conscious decisions associated with a very complicated problem domain.

Fist off, why is maven great?


The best thing about maven is the standards it imposes on you.  This has the obvious downside of it feeling restrictive at times.  However, the benefits are huge.  The standard directory layout, standard build phases and of course the standard dependency management have been around so long now that we tend to take them for granted.  Unless a project uses custom or obscure plugins or overridden many default settings, each project looks and builds in a familiar way.  As mentioned on maven.apache.org, " ... [it's] only necessary to learn a small set of commands to build any Maven project".

It almost feels redundant to mention (as it's been the case for so long) but these standards help so much when learning a new code base.  The sight of a pom.xml almost comforts me when browsing a new code base!

Issues/Problems I have had, but have now come to understand

 

Sometimes an artifact in your local repo is ignored - this can be confusing

A while ago I thought I was loosing my mind when maven complained that it could not find an artifact despite it clearly being in my local repository.  It turns out that since I was now using a different maven repo, it no longer trusted the artefact from the previous repo.  I thought this was very strange but as Jason van Zyl states: "Artifact A downloaded from X is not the same thing to Maven 3 as A downloaded from Y".  This extra complexity seemed so ridiculous at first, however non-deterministic builds are surely a lot worse.

Arguments against this behaviour:
  • Adds confusion. 

Arguments for this behaviour:
  • To avoid non-deterministic builds.


While pros and cons of this feature can be discussed (I do wonder how often "Artifact A" actually differs in repos "X and Y") one thing is clear... a better error message is definitely required, see here.

Transitive dependencies are the wrong scope - your project can start depending on libraries without you realising it

I didn't realise this until it was pointed out by a colleague recently, however a compile (default) scoped dependency should bring in its dependencies (transitive dependencies) with a runtime scope and NOT compile.  See the asterisk in the dependency scope table.  If this was the case, you wouldn't find your code silently increasing its dependencies (i.e. not explicitly listed in your pom.xml) during development.  This can lead to a bit of a mess that is easy to get into given the fact that IDEs obviously obey maven's rules when showing you what you can import.  If you can import it and want to, you do so without realising your code is now tied to another library that it wasn't before.
I now realise that I have lived with this "feature" for so long that I have come to accept that dependencies can become unwieldy without an extremely close eye.

Arguments against this behaviour:
  • To stop your project silently depending on a library.

Arguments for this behaviour:
  • To make importing libraries easier, i.e. to stop you having to explicitly list transitive dependencies when you extend a class that in turn extends a class from another library.

Wednesday, 30 April 2014

SAML feels like a missed opportunity

"The nice thing about standards is that you have so many to choose from" - Andrew S. Tanenbaum

This quote is very appropriate for Single-Sign-On and specifically SAML.  Here I will discuss why SAML is a great protocol for point to point integrations, but can get very complicated very quickly once you take it beyond that.

Single Sign On - Why is it so hard?

Single Sign On (or SSO) can be described very simply, to quote wikipedia "...user logs in once and gains access to all systems without being prompted to log in again at each of them". This boils down to three different entities who trust each other directly and indirectly.  A user enters a password (or some other authentication method) to their identity provider (IDP) in order to gain access to a service provider (SP).  User trusts IdP, SP trusts IDP so SP can in-turn trust user.
This seems so simple, however if you are a service provider and want to integrate with many IdPs (e.g. twitter, facebook, educational institute, corporate institute, etc. etc.) the number of standards, protocols, frameworks, custom implementations involved make this very complicated.

I agree with Adrian Cockroft - you should favour open source where possible as it involves no procurement overhead.  With that in mind, we looked into open source options.  However, the problem here is that even the very best open source libraries are going to be very complicated when the problem domain is so complicated.

Companies that provide SSO solutions have a vested interest in presenting the complexities of integrating with many IdPs.  Before looking at open source options I was incredulous of these complexities, however I now find myself agreeing with their sales pitch having looked into the open source alternatives.

Protocols Standards and Frameworks

 

OAuth feels as new and shiny to me as the social media IdPs that use it.  However it's not a protocol, it's a standard.  Eran Hammer describes OAuth 2.0 as a "designed-by-committee patchwork of compromises".  All of this means "it's not possible to have a generic OAuth sign-in library that will work with any arbitrary site".

OpenID seems like a great idea, but it hasn't been adopted nearly enough to make it a success.  In fact, I only became aware of OpenID through writing this post - admittedly this might be more down to my poor background knowledge on this subject.

SAML on the other hand has a lot going for it...


  • It's a protocol - This obviously means that you can have a generic library that will work with any SAML compliant SP/IdP.
  • Federations make integrating with more than one entity easier - The UKAMF for example provides metadata for each institute member which standardise (and make discoverable) things like URLs, certificates and which specific SAML variants are supported.
  • It has been very widely adopted - Many educational institutes use it (almost all UK Universities for example).  It is also supported by the likes of google and salesforce.

 

SAML's biggest problem - it's too flexible


The User Flow is flexible.

A user can start his/her journey at the SP, where the user is re-directed to the correct IdP.  This sounds as simple as a 302 redirect but it is also necessary to instruct the IdP which SP to take the user to after authentication, this is achieved by adding an AuthNRequest within the request to the IdP.
A user can start his/her journey at the IdP (bypassing the AuthNRequest detailed above) which results in a unsolicited SAML assertion sent to the SP.  Why do we have to support both of these?

User Data can be sent in different ways

SAML Assertions (user specific data) can be sent from the IdP to the SP via the user's browser in three different ways called Bindings.  
  • HTTP POST Binding - The SAML assertion being submitted in the body of the request.
  • HTTP Redirect Binding - The SAML assertion is encoded as a URL parameter.
  • HTTP Artifact Binding - The user submits a reference to an artifact (via POST or GET) which is then resolved via a back channel.
Clearly, it would be far simpler to just support one of these (preferably not the HTTP Artifact Binding).

SAML assertions can be verified in different ways

An IdP can choose to sign their SAML assertions using Public Key Infrastructure (PKI) or by the direct key model.  The direct key model assumes that you have obtained the IdP's public key through a trusted means.  The UKAMF is currently in the process of dropping support for PKI which is certainly a step in the right direction.

What I'd like to see in future SAML specs

Less.  All of the flexibility included in the SAML spec makes it more complex, harder to implement and reduces its chances of becoming the de-facto standard for SSO.

Sunday, 16 March 2014

Notes & Learnings from Q Con London 2014 - Day 3

Gunter Dueck - The World after Cloud Computing & Big Data

Gunter is a funny and intelligent man with a delivery style I would compare to that of a stand-up comedian.  Some of his content was on dangerous ground, but he also made some very interesting points.


Gunter showed a diagram similar to the above which illustrates the choices we face when creating an IT solution, which I'm inclined to agree with.

He also showed another diagram which I'll re-create in list form.  

  1. Creative.
  2. Skilled.
  3. Rote work.
  4. Robotic work.
Gunter made the point that work starts off at the top of this list and gradually works it's way down until eventually it's fully automated.  

What I took away... Make sure your work is as close to the top of the list as possible.


Akmal B Chaudri - Next Gen Hadoop: Gather around the campfire and I will tell you a good YARN

This talk was aimed at Hadoop novices which was perfect for me and also the reason as to why I didn't "get" the joke in the title.  Akmal is from Hortonworks which is a company (similar to Cloudera) that offer professional services to support the Hadoop framework.

Big Data 
It now takes only two days for the world to store as much data as it stored from the beginning of time until 2003.  The three Vs of big Data:

  • Variety - unstructured and not fit for a relational DB.
  • Volume - also keeping for a longer duration.
  • Velocity - generated at very high speed.
All of this data, in it's many different forms (Sentiment, Clickstream, Sensor/Machine, Geographic, Server Logs, Text etc etc) is useful to us.

Hadoop
Hadoop is a framework for solving a data-intensive process.  It's fast for large jobs but not small ones.  Hadoop can be integrated with many of your existing data stores and doesn't have to replace anything to add value.

Hadoop 2.0 consists of the following modules:

  • Hadoop Common
  • HDFS - The Hadoop Distributed File system.
  • YARN - A framework for job scheduling and cluster resource management.
  • Map Reduce - For processing large data sets in parallel.
Pig - An extension of Hadoop that simplifies ability to query large data sets on HDFS.
PigLatin - A way to query the data.  Good for people performing ETL.
Hive & HiveQL - Like SQL so less of a learning curve.  This gives a view of the data that is like a relational DB. 

HDFS Name Node - Keeps track of where data is.
HDFS Data Node -  Stores he data.

The exact same data can be stored on multiple nodes for redundancy.  This means loosing one node will not loose your data.

What I took away... Hadoop is not a single framework to learn, it's many related frameworks that seem to be changing rapidly.

Joakim Recht - The Mean & Lean Pipeline

Joakim talked about how Tradeshift has changed as a company and how it's build pipeline changed with it.  He summarized the company's early days (when it was a startup) as:

  • Not enough time.
  • Not enough people.
  • Not enough money.
  • Too many crazy ideas.
  • Too many requirements.

Joakim said that the perfect build pipeline should have the following characteristics:

  • Stops bad code going live.
  • Fast.
  • Ensures consistency
  • Fully automated
  • Creates nice screens and buttons to click.
  • Allows you to talk about it on conferences

...however, setting all of this up takes time and effort!

The development/infrastructure/build-pipeline looked like this in 2010:


  • Java backend.
  • All on Amazon.
  • Drupal front end
  • Subversion - no branches - everything on HEAD.
  • Hudson to run tests.
  • Deploys to prod - semi automatic.
  • Little docs but Lots of tests (which define behaviour)
Moved from Subversion to git.

In an effort to improve UI tests, they migrated away from "fragile" Selenium to Geb and Spock (Groovy based BDD framework)

By Mid 2011 the number of teams and grown which made keeping track of build status on branches was hard.  They custom made a nice build screen which collected build info.  The code for this is not in an open-sourceable state!

New Office in San Francisco
Starting the new office could have gone better for Tradeshift.  I say this because the "new office" slide featured an animated gif of someone putting a knife in a toaster and then being showered by burning hot components in the ensuing explosion.  Having distributed teams working on the same code base sounded like a big cultural change which demanded technical changes to their build pipeline.  More consistency and rules were enforced with Jenkins.  Along with this came mandatory code reviews on github (some requiring a specific group of reviewers, e.g. DB changes).

Other recent issues can be found in this blog post.

What I took away... The better your build pipeline, the more you have to invest in it up-front.  As teams increase in size or become more distributed, the better the build pipeline has to be.


Glen Ford - Lean Under Pressure

Glen (chief architect at Zeebox) clarified the meaning of lean and pressure:

Pressure "The use of persuasion, influence or intimidation to make someone do something."

Lean (in software) -

  • Eliminate waste
  • amplify learning
  • decide as late as possible
  • deliver as fast as possible
  • empower the team

All of the factors above can create tensions within the team as that is where the compromises on the above have to be made.

Tensions within a Startup (I personally think this can be extended to any work) - What you think you need vs What you can get away with.

The Story

  1. Two founders have an idea.
  2. They create a proof of concept.
  3. Investment gained.
  4. To create the idea, they hire (brings challenges).
  5. To deliver as quick as possible, they split the big team up (divide and conquer). 
  6. Inconsistencies creep in with "the idea".
  7. They then slimmed down (removed some management).
  8. Set a goal which was unattainable - Managers like doing this generally (according to Glen), but in a startup there is a notion that you work very hard.
  9. The unattainable goal inevitably results in failure.
  10. Team demoralised - some burnt out.

They go back to the drawing board and try again - this time with better results.


  1. Teams re-organised - cross functional product teams to deliver vertical slices of the idea.
  2. Fostered a culture of urgency but NOT panic (attainable goals).
  3. Increased learning and knowledge sharing with things like lightning talks.
Other things which helped:

  • You build it, you run it (no ops team).
  • Encourages delivering small improvements - Understanding operational costs and being open with them.
  • Garden architecture - However, you must accept the need to prune and shape as you go along.


Final point: Don't be disheartened.  Learn and adapt.  Enjoy the journey as it never ends.

What I took away... Tensions in a team are inevitable as it's where the many compromises have to be made. Cross functional product teams seem to work everywhere.


Volker Pacher & Sam Phillips - How Shutl delivers even faster using the Neo4J, the Graph Database

A very interesting talk on the architecture of Shutl.  Their business provides a point to point delivery service which is an alternative to the standard Hub & Spoke" model (i.e. big distribution centres where packages are sent to and dispatched from) which is offered by the big companies such as Royal Mail.

Generating a quite for a user involves generating many 

Russell Miles - PaaS - Epic search for Truth

Russell explained that a Platform as a Service can promise (and deliver) a huge number of services for you to take advantage of.  PaaS can offer the following:-

  • Database
  • Integration
  • User Management
  • Development workflow
  • Build services
  • Health Monitoring
  • Multitenency
  • etc etc etc
All of this looks and sounds so great that... "Anyone on the golf course will buy it!"

The risk of PaaS
Libraries will do what you want but frameworks will say "work my way".  A PaaS is a framework which inevitably reduces your flexibility.  A PaaS needs to be as changeable as the software you are running on it. 

How to stay flexible with PaaS?
  • Ensure that the barrier to entry for your PaaS is low.
  • FaaS - Fail as a Service - This means apply Darwinian evolution.  Experiment with multiple PaaS as opposed to just using one for everything.

What I took away... PaaS offers many benefits, but beware of the hidden coupling to your software.

Sunday, 9 March 2014

Notes & Learnings from Q Con London 2014 - Day 2

Tim Lister - Forty Years of Teams

Or... "Forty years of playing well with others".  A really uplifting talk by Tim introduced us to day two, where we looked at Tim's very interesting career.  The alternative title was given because "one person cannot do anything substantial" and even if you could it's more satisfying saying "we did this!"

His IT career started by chance, when he found himself living with Fred Wang at University.  Fred's father, An Wang, was founder of Wang Laboratories which meant him and Fred had access to more computing power than his entire University.  They used this to analyse data from their fruit fly breeding as part of his Biology degree.

Tim was also lucky enough to be mentored by computer scientist Michael Jackson who was a "software poet" and produced code so beautiful and lucid, it was as if it were written by aliens.

The term "Best practices" can undermine the intellectual work.  I interpret this point as not blindly following rules regardless of circumstances e.g. "all code released to production must be performance tested".

Tim (with his friend Tom Marco) wrote the very popular book Peopleware now in it's third edition.  On the subject of writing "I think I am a sharper thinker when I write and then re-read it".  I couldn't agree more.

What I took away...Writing helps you gather your thoughts.  If you don't get innate pleasure from IT, move on.

Michael Nygard - Manoeuvrable Architecture

I love aviation so was pleased to see Michael relate aircraft design which win dog fights to an architecture that succeeds in business.

  • A plane that can fly fastest and highest will not necessarily win a dog fight - According to military strategist John Boyd.
  • The plane most likely to win is the one that can change it's speed and direction the quickest.
  • In military strategy, you need to set the tempo of engagement so your enemy adjusts to you.
  • In IT, this means you need to be able to scale up and down fast.  
  • The F16 was almost given a ladder which would have hampered its ability to accelerate/decelerate -   Make sure your projects don't fail due to incremental compromises.
One of Boyd's key concepts is the OODA loop... Observe, Orient, Decide, Act.  The quicker you can do this, the higher the chance of success.

The concepts Michael discussed:

Plurality - It's ok to have multiple services in multiple styles with multiple versions existing at the same time.  Darwinian evolution can be applied to determine what survives.

Use URIs with abandon - This has many advantages.  For example, the terms and conditions on any website frequently change.  If a user registers on a given date, the T's and C's URI can be stored against that user keeping a record of what they agreed to.  Also, the URI stores the reference and complete context (as opposed to a mere id).

I spoke to Michael at the end of his talk as I was keen to get a book recommendation on some of his subjects, his response was "I really should write one".

Dan North - Deliberate advice from an accidental career

Contributor to 97 Things Every Programmer Should Know and creator of first ever BDD framework JBehave.  Much like the day's first talk, Dan reflected on his career so far.  His talk was divided into the following points.

  • Today we are going to learn about DB restores - I missed the start (was talking to Michael, see above) but gathered that Dan once shut down a production DB.  What was memorable about this was his manager's reaction... calmly saying "today is the day we learn about database restores".
  • "It's a fantastic feeling to have someone believe in you" - This sounds corny but I can definitely point to a few things people have said to me that have had a huge impact, the same goes for Dan.
  • They're not pairing, they're just helping each other... at their own desks.  Some developers will resist collaboration (pair programming), but when people are prepared to meet each other half way, collaboration can flow.
  • The stacktrace - Code for the person who supports it... always!  A colleague tried to interpret the root cause of a stacktrace and found it hard despite being told by Dan that it was work in progress.  
  • The Product - Think about the value you add to the product when you code.  Ugly code in live could be giving the business value, beautiful code in development offers none.  
  • The Leader - "You'll have to teach me" - A culture of not being afraid to admit you don't know something is very healthy.  Put the organisation ahead of your ego.
  • Don't hire yourself - We are not rational people and we are biased to like people that are similar to ourselves.  Hiring is one of the most important things in business so read Peopleware (see above).
  • You are what you do, not what you say.  This is especially relevant when under pressure.

Trisha Gee - HTML5, Angular.js, Groovy, Java, MongoDB all together - what could possibly go wrong?

Trisha was able to create all tiers of a web application in-front of a few hundred IT professionals.  She was brave to take this on and skilled to pull it off.

What I took away...Get your IDE keyboard shortcuts sorted!  Trisha was so quick at using hers I could hardly keep up at times. AngularJS looks awesome.

Adrian Cockcroft - Migrating to Microservices

Probably one of the best talks of QCon.  I just wished I could press the pause button as keeping up was hard at times.  Warning - there is a lot of relevant info missing from this.

A brief timeline of Netflix's architecture:
  • 2009 - People said Netflix were crazy.
  • 2010 - What Netflix is doing won't work.
  • 2011 - It only works for unicorns like Netflix (This explains Adrian's twitter avatar)
  • 2012 - OK but we can't do that.
  • 2013 - We're on our way using Netflix OSS code.

A few principles Adrian shared with us:

  • Speed to marketplace is paramount - Netflix learns fast.
  • High trust low process facilitates this.
  • Don't create your own, when something already created will do the job.  Don't create your own voltage regulator for your servers...
  • Favour open source - There is no procurement overhead.
  • Reverse Conway's law - Decide on the architecture you want and then create/structure the teams.
Microservices:

  • Additive model - don't edit or drop an existing service, just create a new version of it.
  • Roll out your changes gradually with small sections of users.  Netflix use the EU as their testing ground!  


Further reading:
Adrian recommended the book Lean Enterprise which is due to be released in May of this year.  To convert from your old architecture read Refactoring Databases.  To look at how some of these new data stores really perform, have a look here.

Bodil Stokke - What Every Hipster Should Know About Functional Reactive Programming

Bodil created an interactive game using the RxJS set of libraries during the talk... very impressive.  Source code available here.  

Linda Risling - Organizational Change Myths

The author of Fearless Change spoke to us about the following myths....

  • "Smart people are rational" - They are definitely not as rational as you'd think.  See Thinking fast and slow
  • "Good always triumphs evil" - Just because your idea is good, doesn't mean it will succeed.  You have to fight for your idea.  Linda recommends giving people food in meetings to convince them.
  • "If I just had enough power I could make people change" - If you had the power, all you can hope for is getting compliance, this is not change.  For change, people need to be convinced.
  • "Skeptics, cynics and resistors must be stupid or bad" - Treat these people as valued partners in the change effort, not enemies.
  • "I'm a smart person, I don't need help from others.  After all it's my idea." - The change must not be all about you.
A lot more of this good stuff available in Linda's book Fearless Change.

Thursday, 6 March 2014

Notes & Learnings from Q Con London 2014 - Day 1

I was lucky enough to go to Q Con London 2014.  Here is an update of my experience from day 1.

Damian Conway - Life, The Universe and Everything

Damian is an amazing speaker and made his already fun, interesting, geeky subject, even more fun interesting, and geeky with his great presentational skills. He showed us Perl code (and later Klingonscript) that not only implemented the game of life but also (kind of) disproved Maxwell's Demon. We were also shown an example of a Turing complete machine which was implemented using game of life made by Paul Rendell.  Video also on youtube.

This talk reminded me of another cellular automaton I encountered many years ago whilst studying Genetic Algorithms at University.  I am pleased to say that the creator (David Eck) has converted the original Java Applet version of his program Eaters, that so impressed me, into javacript available here.

What I took away...Coding can and should be fun.

Daniel Schauenberg - Development, Deployment & Collaboration at Etsy

Etsy have around 150 engineers working on a monoloithic php application which forms etsy.com.  Although this sounds like an old fashioned architecture, their delivery pipeline is very impressive.  They manage 50 deploys a day.  Other points:-
  • They favour small changes.
  • 8 people can deploy different sets of changes in one build (that's there magic number).
  • The deploy process the "Deployinator" has two buttons only - no ambiguity, everything automated.
  • They use config to switch features on after the code is safely in production (i.e. soft live releases).
  • Developer VMs can run the entire stack and closely mimic production.
  • They make use of Linux Containers (LXC) On day 1 of working for etsy you deploy a change - This seems like a great idea to me.
  • Tons of dashboards which are monitored after releases.

I asked Daniel at what point etsy decide to roll a build back and stop debugging an issue, he replied that there is no rollback (the closest thing to a rollback is revert commits and create a new build).  This doesn't seem like a good idea to me. I think each build should be rollbackable.  Debugging an issue calmly after you have restored service with a rollback is surely better than a forward only stratgey.

What I took away...Continuous delivery can be achieved without a perfect architecture if you have the right processes and culture in place.  Etsy should be applauded.

Uwe Friedrichsen - Fault Tolerance & Recovery

Uwe explained a lot of common fault tolerance patterns, including...
  • Timeouts.
  • Circuit breaker.
  • Shedding load.
Uwe also explained the Netflix library Hystrix.  

What I took away... There are a lot of good patterns out there for fault tolerance, a lot of them are discussed in Release It!

Graham Steel - How I learned to stop worrying and trust crypto again

Graham started his talk by saying that "the paranoids were right" - NSA has been intervening with crypto standards.  However, all is not lost as properly implemented crypto systems can be relied on.

Here's what makes a good crypto API:-


  • Open and subject to review - More likely to have bug fixes and patches applied.
  • Supports modern cryptographic primitives.
  • Good key management - This can be the Achilles heel of a crypto API.
  • Mistake resistant - Some think people should be free to make mistakes.  I don't!
  • Interoperable.
Examples of crypto APIs.... My notes are sparse here...

  • Big success, widely used.
  • Old - Current version 2.4 was released in 2004.
  • Implementation is closed, but the standard is open.
  • The API is closed (under Oracle's control), however some providers, like Bouncy Castle are open source.
  • Has around fifty "Fix me" comments in it!
  • Contains the NSA backdoor random number generator which you are free to use or not.
  • Work in progress to provide encryption in clientside javascript - Lots of challenges here!
Lessons
  • If you roll your own cryptography implementation, you are guaranteed to have a floor in it.  
  • TLS has been around for years and they found issues just 3 days prior to this talk!
  • Applied Cryptography is a dangerous book as it encourages you to roll your own cryptography.
What I took away...  A new understanding of how much I don't know about security and cryptography.

Dave Farley - Continuous Delivery

According to Dave... "Our highest priority is to satisfy the customer through early and continuous deliveries of valuable software."


  • Finished means in production.
  • Break down silos, Business, Testers, Developers, Operations, all need to collaborate more to make it work.

Feedback loops - starting with smallest and quickest:
  1. Unit Test - Code.
  2. Executable - Build.
  3. Idea - Release.
Principles:
  • Keep everything in version control - I agree.  Any tool that has a UI-only is to be avoided in my opinion.
  • Automate everything.
  • Build quality in.
  • If something is painful, do it more often.  If you release every 6 months because it's painful, the lesson is release every 2 months, then month, then week etc etc.  Don't start releasing every year, it will get worse.
  • Manual testers are far better at exploratory creative testing than repetitive regression.

Accidental benefits:
  • When debugging a long standing defect, if version control and deploy are fast and reliable, you can determine which build introduced it.  Binary chop through a list of builds until you have narrowed it down,
  • Automating everything can introduce auditing for free.
What I took away... Dave Farley knows this subject inside and out and his, Continuous Delivery is highly recommended.


Paul Simmonds - Identity is the new currency

Data is going from: 
  1. Internal.
  2. De-perimeterized.
  3. External Collaborations.
  4. Secured Cloud.
Interested or working with security on the cloud... you must read:

Friday, 28 February 2014

Client Side vs Server Side Session

We recently looked at replacing our legacy session management system with a new one.  During this analysis, we came close to choosing a client side session but eventually concluded server side was better for us.  Here's why...

Client side session


In this model, all session state is stored in the client in a cookie.  The benefits of this are you don't need to worry about persisting and replicating state across nodes, session validation is lightning fast since you don't need to query any data store which means it's super scalable.  The session cookie must obviously be tamper proof (to prevent people creating a session of their choice) which is achieved by signing the cookie using asymmetric cryptography.

The signing of a cookie value uses the private key, the validation uses the corresponding public key.  Our idea was to try and keep the private key as private as possible by storing it in memory only.  Each node (4 shown below) would create a new private and public key on startup, with the public key being persisted in a replicated data-store.  This means (providing replication is working) any node is able to validate a session created (signed) by any other node.


Benefits of a client side session:

  • Low Latency - Validating and creating sessions is lightning fast as it doesn't need to hit the data-store.
  • Fewer points of failure - DB is only required on startup of each node.
  • Highly scalable since it's stateless (slight caveat of replicated public keys).
  • Possibility of distributing validation away from application nodes.  If the public keys are exposed via an api (provided by the nodes shown above), this means any other system can validate a session without having to make a synchronous call.  This makes session validation even quicker in a distributed system.

Cons of a client side session:

  • Sessions cannot be terminated (there are workarounds to this which I'll get to).
  • Logout action is not fully implemented - The session cookie can be dropped from the browser, but it would still work if resubmitted.
  • Implementation and user details are exposed since everything is stored in a cookie.
  • Cookie size is greater - Sounds trivial but can cause huge headaches (see below).
  • No 3rd party frameworks available (none in Java we could find anyway).

Terminating a client side session
This can be achieved by setting an absolute timeout on session create to a long period of time (say 1-2 days) which, when exceeded, will cause the session to be invalid and force the user to re-authenticate. 

Keeping cookie size to a minimum
This is a trade-off between complexity (in terms of parsing and maintaining sessions), security (strength of signature) and cookie size.  Maintainability is important given that your state will be living "in the wild" in your many clients' browsers.  If you ever want to add, remove or change an attribute in your users' sessions keep in mind you will need to support the two versions of your cookie as you deploy your changes.



The simplest thing to do with your data is to Base64 encode it (careful of the padding character, the default is a non-cookie friendly equals) using a format such as JSON or XML. Using such a format comes with the benefit of readability and maintainability. 

To get the cookie size down to a minimum, a list with a cookie safe delimiter (e.g. a dot) and effectively encoded/compressed could be the way to go

The more secure the signature, the more space it will consume in the cookie.  For example, signing 46 bytes of ASCII characters gives:

87 byte signature using a 512 bit key to sign.
172 byte signature using a 1024 bit key to sign.


Server Side Session


This is the traditional model which relies on storing all state in a replicated server side data-store.  The cookie is tiny as it only stores a reference to data stored on your server.

Benefits of a server side session:
  • Full control of session - Can terminate a session on demand instantly.
  • Existing frameworks can reduce amount of custom code to develop/support - See Apache Shiro
  • Cookie size is much smaller.
  • Implementation and user details are not exposed since only reference to session data is stored in the cookie.
  • Can store as much session related data without fear of increasing cookie size.
  • Implementation of session management can change easily since we store the session state ourselves and is not “in the wild” (i.e. not stored on many different browsers).

Cons of a server side session:
  • More points of failure - If the DB is unavailable, no sessions can be created touched or validated
  • More overhead in creating and touching session (a DB read, write and replication is required). This can be mitigated by applying the DB write asynchronously.
  • No future potential for applications to verify session without having to call nodes.

The points that swung the balance to server side


Session Timeout.  We decided that after 30 minutes of a user's inactivity, we are obliged to log them out if they have elected to not "Remember me".  Relying on the user closing the browser, we decided, is not enough.  This means that you need to store a rolling timeout which gets updated on each page view (this is in addition to the hard timeout discussed regarding session termination).  This requirement makes the client side session a lot more complicated since each time a page is viewed, the session must be updated, re-signed and dropped (Set-Cookie) back to the user.

Being able to kill sessions can be useful.
There are rare occasions when we want to kill a user session.  Sometimes this is a business reason, for example if we suspect that a user has shared their credentials with other people.   It is also reassuring to know we could terminate user sessions immediately in case of a security breach, e.g. hi-jacked accounts.  This just isn't possible with a client side session.  When you start trying to solve this issue with server side state it is my opinion that you have the worst of both worlds (i.e. the complexity of client side signing and server side replicated state).

Cookie size
If cookies get too big the impact can be huge, worst case your users see a blank page with something in your stack returning a 431 status code.  This is due to the request headers exceeding a certain limit.  How do you know when you'll exceed the limit?  How do you know what your limit is?  In a distributed system with multiple entry points, your maximum http header size might vary.  This could cause (seemingly) random pieces of functionality to fail for users with big cookies.  


On balance, the server side session is the solution for us.  However, if your organisation has different requirements, the client side session is worth a look.

Sunday, 9 February 2014

Lessons learned from a connection leak in production

We recently encountered a connection leak with one of our services in production.  Here's what happened, what we did and the lessons learned...


Tests and Monitoring detected the issue (silently)

Our automated tests (which run in production for this service) started failing soon after the incident occurred.  We didn't immediately realise since they just went red on the dashboard and we just carried on about our business.  Our monitoring detected the issue immediately and also went red but crucially didn't email us due to an environment specific config issue.

The alarm is raised

A tester on the team realises that the tests had been failing for some time (some time being 4 hours... eek!) and gets us to start investigating.

Myself and another dev quickly looked into what was going on with our poorly dropwizard app.  We could see from the monitoring dashboard that the application's health check page was in error.  Viewing the page showed us that a downstream service (being called over http) was resulting in the following exception...

com.sun.jersey.api.client.ClientHandlerException:
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool


My gut reaction was that this was a timeout issue (since it usually is) and connections were not being released back to the connection pool after either slow connections or responses.  We saved the metrics info from the application and performed the classic java kill -3 to get a thread dump for later reading.  We tried checking connectivity from the application's host to the downstream service with telnet which revealed no issue.  We also tried simulating a http request with wget which (after a bit of confusion trying to suppress ssl certificate checking) also revealed no issues.  With no more debug ideas immediately coming to mind, restoring service had to be the priority.  A restart was performed and everything went green again.

Looking in the code revealed no obvious problems.  We had set our timeouts like good boy scouts and just couldn't figure out what was wrong.  It clearly needed more investigation.

We see the same thing happening in our Test environment... Great!

With a lot more time to debug (what seemed like) the exact same issue we came up with the idea of looking at the current connections on the box....

netstat -a 

This revealed a lot of connections in the CLOSE_WAIT state.

A simple word count....

netstat -a  | grep CLOSE_WAIT | wc -l 

...gave us 1024 which just so happened to be the exact maximum size of our connection pool.  The connection pool was exhausted.  Restarting the application bought this down to zero and restored service as it did in production.

Now we can monitor our leak

With the command above, we were able to establish how many connections in the pool were in the CLOSE_WAIT state.  The closer it gets to 1024, the closer we are to another outage.  We then had the great idea of piping this command into mailx and spamming the team with the number every few hours.

Until we fix the connection leak - we need to keep an eye on these emails and restart the application before the bucket over fills (i.e. the connection pool is exhausted). 

What caused the leak?

We were using Jersey which in-turn was using HttpComponents4 to send a http request to another service.  The service being called was very simple and did not return any response body, just a status code which was all we were interested in.  If the status code was 200, the user was logged in, anything else we treat as unauthorised.

The code below shows our mistake....

public boolean isSessionActive(HttpServletRequest httpRequest) {
   ClientResponse response = jerseyClient
      .resource(url)
      .type(APPLICATION_JSON_TYPE)
      .header("Cookie", httpRequest.getHeader("Cookie"))
      .get(ClientResponse.class);
  return response.getStatus() == 200;
}


We were not calling close() on the ClientResponse object which meant that the response's input stream was not being closed, which meant that the connection was not released back into the pool.  This was a silly oversight on our part.  However, there's a twist...


We had run performance tests which in turn executed this method a million times (far more invocations than connections in the pool) without us ever hitting any issues.  The question was, if the code was wrong, why was it not failing all the time? 

For every bug, there's a missing test

The task of creating a test which would fail reliably each time due to this issue was 99% of the effort in understanding what was going on.  Before we identified the above culprit, a tester on our team, amazingly, found the exact scenario which caused the leased connections to increase.  It seemed that each time the service responded with a 403, the number of leased connections would increase by one.

Our component tests mocked the service being called in the above code with a great tool called wiremock which allows you to fully mock a web service by sending http requests to a server under your control.  When we examined the number of leased connections after the mocked 403 response, frustratingly it did not increase.  There was something different from our mocked 403 response and the actual 403 response.

The issue was, in a deployed environment, we called the webservice through a http load balancer that was configured to give an html error page if the response was 4xx.  This scenario created an input stream on the ClientResponse object which needed to be closed.

As soon as we adjusted our wiremock config to return a response body, we observed the same connection leak as in our deployed environment.  Now we could write a test that would fail due to the bug.  Rather than stop there, we concluded that there was no point in us checking the status after the one test alone as next time it could fail in a different place.  We added the following methods and annotations to our test base class:

@Before
public void initialiseLeasedConnectionsCount() throws Exception{
   leasedConnectionCounts = getLeasedConnections();
}

@After
public void checkLeasedConnectionsHasNotIncremented() throws Exception{
   int leasedConnectionsAfter = getLeasedConnections();
   if (leasedConnectionCounts != leasedConnectionsAfter){
      fail("Expected to see leasedConnections stay the same but increased from : " + leasedConnectionCounts + " to: " + leasedConnectionsAfter);
   }

}

If we hit another connection leak, we'll know the exact scenario and it shouldn't get anywhere near production this time.  Now we have a failing test, lets make it pass by fixing the connection leak....

public boolean isSessionActive(HttpServletRequest httpRequest) {
   ClientResponse response = null;
   try{
      response = jerseyClient
         .resource(url)
         .type(APPLICATION_JSON_TYPE)
         .header("Cookie", httpRequest.getHeader("Cookie"))
         .get(ClientResponse.class);
      return response.getStatus() == SC_OK;
   }
   finally {
      if (response != null){
         response.close();
      }
   }
}

Lessons Learned


Make sure your alerts alert

Until you have proven you are actually getting emails to your inbox, all your wonderfully clever monitoring/alerting may be useless.  We lost 4 hours of availability this way!
I think that this is so important that I'm tempted to say it's worth causing an incident when everyone is paying attention just to prove you have all the bases covered.  Netflix do this as routine and at random times too.


A connection pool almost exhausted should fire an alert

If we had received an alert when the connection pool was at around 75% capacity, we would have investigated, our connection leak before it caused an outage.  This has also been mentioned by bitly here.  I will endeavour to get this in place ASAP.


Learn your tools

Getting a thread dump by sending your java process the SIGQUIT (i.e. kill -3 [java process id]) is fine but it relies on you having access to the box and permissions to run it.  In our case (using dropwizard and metrics) all I needed to do was go to this url /metrics/threads on the admin port for exactly the same data formatted in json.

Other Lessons?

If you can spot any other lessons learned from this, feel free to comment.

EDIT: See here for a more detailed explanation of how to test for connection leaks and this time with all the code! 

Friday, 31 January 2014

What I learnt from the Phoenix Project

Having just read this book - The Phoenix Project - A Novel About IT, DevOps and Helping your Business Win - Gene Kim, Kevin Behr, George Spafford -  I'm keen to express exactly what I learnt so I don't forget its important messages.

The book follows Bill, a "Director of Midrange Technology Operations" who is forced into a more senior role of "VP of IT Operations".  As soon as he is given this job he's on the back foot trying to work out why stuff keeps breaking.  As the book goes on, we get the impression of an IT department stumbling around from disaster to disaster in complete and utter disarray.  Bill guided by his khaki pants guru Erik, turns the place around.

Here's what I learnt...

Find your bottlenecks - The books message here is that any improvement made anywhere besides the bottleneck is wasted.  This makes perfect sense and comes from comparing IT to a manufacturing pipeline and applying the Theory of Constraints.  Not only does it make perfect sense, it's easy to remember.  The really hard part is firstly identifying them.  I think I now realise why our project manager is so insistent that we keep our project management view (Rally) up to date with the status of stories.  If you don't reliably track where things are and for how long, how do you identify your bottlenecks?  In the absence of proper stats and figures I would guess that it would come down to (unreliable) anecdotal evidence.  Fixing them and not creating more as you go is another story.

Prioritize - This comes down to saying no (or at least "not now") and actually choosing between competing deliverables.  If you never decide or prioritize, the ultimate extreme manifests as a huge amount of WIP which creates overhead (in managing all the stories) inefficiencies (by switching from one thing to another) and a feeling of stress due to never feeling on top of things.

Is this work absolutely necessary? - A huge portion of work on the IT department's backlog was to fix auditing issues.  It emerges later that they don't need to be fixed since manual actions already in place prevent them from becoming real issues.  I think the lesson here is...  Don't dictate the solution of a problem.  As soon as you dictate the solution, that's what will be built.

Respect change control - I have always found the form filling and approval process of changes/releases a pain in my IT career.  However the chaos depicted in this book, of no change control at all, makes me now happy to suffer this pain and make sure all details submitted are accurate.  This was depicted very well in the book whereby an incident was blamed on one change (which when rolled-back made things even worse) when the actual culprit was a change no one knew about.

Big bang releases are to be avoided at all costs as they are risky and don't release business value quick enough.

The other lessons from the book were equally relevant but perhaps less profound to me.

  • Key man dependencies are bad as they introduce bottlenecks.
  • Blame culture slows the fixing and learning processes.
  • Us and them culture between IT and business obviously isn't good.

If you haven't read it, do.  Aside from it's important lessons, it's a lot of fun reading about massive IT catastrophes with the comfort of not being involved.

Wednesday, 1 January 2014

The more people contribute to a team's output, the closer they should be to the team.

Everyone contributing to an IT project, ideally needs to be on the team responsible for delivery.  When this isn't possible, the more they contribute the closer they need to be to the team. 

The above point probably seems obvious to the point of being redundant to most people.  I'm sure that many teams (like mine) are structured with a healthy enough mix of skills that they can be mostly self-sufficient. With our team of testers, developers, infrastructure people etc etc I too would have thought we had this base more than covered. 

However, beware that contributors to a team's output come in many different guises.  I feel that we recently suffered on a project due to a key contributor being very distant from the team.

The project involved adding a new user journey to ft.com.  The UX team had designed a user journey, in isolation from the development team, which we were to implement.  The user journey that had been created was great, however, the team had another idea which involved less technical effort and might just get us in production (with that crucial real user feedback) sooner.

Thankfully, the product owner, project manager (and I think the UX team) were all open minded in considering the alternative user journey the team proposed.  After-all, the original flow could be added later as an enhancement after go-live.  The team set off in development with an aim of keeping both options open for a while, but the path of least resistance was clearly the original option and it was up to the team to persuade otherwise.

Sadly, mockups of the alternative flow were never created.  I suspect that, as a team, we felt out of place creating them ourselves as none of us wore the UX badge. We could have asked the UX team ourselves but given we didn't know them very well, they sat on a different floor and were outside of technology.  I suspect there were too many awkward hurdles for us to overcome.  Or, to reference Conway, the "organization's communication structure" did not (easily) allow it.

Eventually a decision had to be made and the original option was chosen.  After all, the original option was more "known" to people as it had been presented numerous times with mockups. 

As mentioned previously, the original (and chosen) user journey was good and I'm very pleased to say that it's currently live and being used well.  But I think we missed a trick by not giving the alternative user journey a fair chance.  This would have been more likely to happen had we been closer to the UX team.