How to test for connection leaks

Connection leaks are easy to introduce, hard to spot, can go undetected for a long time and have disastrous consequences.

It's not enough to fix a connection leak or code against them, you must write tests that would fail in their presence.  Here I will discuss a number of techniques that could apply to many different languages and frameworks.  I have also provided some concrete examples on github which make use of Dropwizard, JerseyClient and WireMock.

What is a connection leak?


Each time system A talks to system B over a protocol like http, it requires a TCP/IP connection to send and receive data.  Connections can be re-used for different requests, via a connection pool, or a new connection can be established for each request.  Once system A has finished with the connection, it needs to either return the connection to the pool, or close it.  If this doesn't happen you have a connection leak.  

Given a connection leak, everything will continue to work just fine until a new connection cannot be created.  In the connection pool scenario, this happens when the number of active connections reaches the maximum permitted for the pool.  In the scenario of not using a pool (i.e. creating a new connection each time) this will go on until either system B refuses to maintain any more active connections or the underlying OS of system A refuses to allocate any more connections to your code.  

Connection leaks are ticking time bombs!  

  • How quick the fuse burns depends on how many times the leaky code is invoked.
  • How long the fuse is, depends on how many connections can be maintained.
  • How much destruction it causes depends on the architecture and how quick you can restart your app!   
As if that wasn't enough, a connection leak will manifest itself in different ways depending on your setup.  Threads could hang, different exceptions can get thrown which can just increase confusion, especially during the panic of a live site issue.

How to test for connection leaks...

Put another way, how do you write a test that will fail due to a connection leak?  There are a few different ways this can be achieved.  The following are ordered by my preference.



Examine the connection pool after your test and ensure that there are no leased connections

Assuming you can inject a http client into your code which uses a connection pool you can examine, this allows you to test for leaks easily.  Once your test has completed, all connections should have been returned to the pool.  If not, your code has failed to release a connection and you have a leak.


Concrete example here:

IntegrationTestThatExaminesConnectionPoolBeforeAndAfterRun.java

In the example above, we setup WireMock to stub the service that our code will call (see the diagram here).  Then, we create an instance of Jersey Client's
ApacheHttpClient4 class that allows an underlying instance of HttpClient to be used.  This allows us to create a connection pool that gets used by our code we can later examine.  Once our test has completed, we simply assert that there are no leased connections:

@After
public void checkLeasedConnectionsIsZero() {
assertThat(poolingHttpClientConnectionManager.getTotalStats().getLeased(), is(0));
}

Benefits of this approach:
  • We test the full http stack.
  • We add very little time to the build.
  • If there is a connection leak we narrow it down very quickly to a specific test.
  • You don't even have to use a connection pool in your production code (if you don't want to for whatever reason).



Acceptance test that examines connection pool before and after your code runs

As per the above technique, this relies on examining the connection pool for leased connections.  The difference here is that this is an acceptance test which allows us to test our application end to end.  Given we are looking at the real application, we must expose the connection pool somehow.

Concrete example here:

AcceptanceTestThatExaminesConnectionPoolBeforeAndAfterRun.java 

The Dropwizard framework makes this really easy to achieve. If you create your instance of JerseyClient using the io.dropwizard.client.JerseyClientBuilder (which is great as it sets sensible defaults on all timeouts) you get useful metrics on the connection pool (which is used by the instance of for free. These metrics are exposed by dropwizard on the metrics page of your application. Examining the connection pool's status can then be performed trivially during acceptance tests

Benefits of this approach:

  • Could be used for manual testing.  If you know that your application is not doing anything, all connection pools should have zero leased connections.  If you are unlucky to find that there is a leased connection, you know you have a connection leak, but you might not know where!

Drawback of this approach:
  • Not thread safe.  Running tests in parallel could cause tests to fail incorrectly.  It could be that two connection-leak-free clients are calling your code at the same time. 
  • Forces you to use a connection pool that you expose metrics on.  Perhaps you don't want to do this, but it can be a huge benefit in diagnosing live site issues.  Monitoring tools (such as Graphite) can alert you to issues before they impact your users.  


Unit test your code with mocked dependencies

Connection leaks occur when you forget to call a method, so you can write a test which verifies the correct method gets called.

Concrete example here:
UnitTestThatShowsUnderlyingOSRefusingToAllocateMoreConnections.java

Drawbacks of this approach:
  • Only verifies that you have called methods you think you need to call to avoid a connection leak.  Working with high level libraries is great as it abstracts the complexities of low level protocols like TCP/IP.  However, verifying that you don't have a connection leak whilst staying at that higher level of abstraction (without looking the under the hood at connections and pools) could lead to more incorrect assumptions.

 

Set the connection pool's max size to one, then run your test twice.

This is a viable option but has a lot of drawbacks.  Here we start our application with a deliberately small connection pool so that a connection leak will become evident after two runs.  If a connection leak is present, the test fails when the client code attempts to access a connection from the pool.  In other words, the client doesn't even attempt to open a connection on the 2nd run since it's only allowed to have a maximum of one open at any point in time.

Concrete example here: 

AcceptanceTestThatRunsMoreTimesThanConnectionsInPool.java

This test makes use of a junit rule I have lovingly copy pasted from Frank Appel that I found here.  Dropwizard makes this easy since we just start the application with a different yml config file that has been deliberately named to scare people using it in real environments (i.e. config-with-connection-pool-of-size-one.yml).  In this example, our leaky code will make this test fail with a org.apache.http.conn.ConnectionPoolTimeoutException Timeout waiting for connection from pool. 

Benefits
  • Don't have to expose your underlying connection pool to the test code.  
  • Assuming that connection pool timeouts have been set, this approach should allow parallel running of tests without false failures.  In other words, two connection-leak-free clients should still eventually pass if they happen to use the same pool at the same time.
Drawbacks:
  • Adds precious time to your build.  I agree with Dan Bodart that builds should be crazy fast.  Doubling up your tests is not a good idea.
  • Easy to get wrong.  If you don't repeat the tests that exercise your code the tests will still pass.  The risk of accidentally running with a normal connection pool (i.e. a max size greater than 1) is mitigated by writing a test that asserts the connection pool is as expectedsee the test: givenApplicationStartedWithPoolSizeOne_whenMetricsViewed_ConnectionPoolIsOfSizeOne


 

Configure the server to only allow one open connection to be processed at a time and run your test twice

Again, like the above, this is a viable option but has a lot of drawbacks.  Here we start WireMock (which uses Jetty under the hood) with a deliberately tuned down number of JettyAcceptors and container threads forcing it to only allow one connection to be processed at one point in time.

Concrete example here:
AcceptanceTestThatRunsMoreTimesThanServerCanTakeConnectionsSimultaneously.java 

I first thought that this test would fail with a ConnectException when it tried to establish a second connection.  Not so.  It actually succeeds in establishing the second connection but then fails with a java.net.SocketTimeoutException.  This is because there aren't any threads free to process the request.


Benefits:
  • Don't have to expose connection pool to tests.
  • Works if you are not using a connection pool.
Drawbacks:
  • As per previous example, adds precious time to your build.

Final thought

My default preference is for the first option "Examine the connection pool after your test and ensure that there are no leased connections" because it adds a negligible amount of time to the build and actually looks at what's going on behind the scenes.  Your choice may depend on project specific factors or something I have not considered. 
Keep in mind that testing can only ever prove the presence of bugs.  While these techniques should increase your confidence of connection leak free code, they can't prove it.  Given that these issues can still get to production, finding them before they affect your users is the name of the game.  Exposing metrics on your connection pools is made very easy by Dropwizard and should definitely be encouraged.  Running soak tests and load tests can also help identify these issues, but obviously come at the cost of adding time to your build pipeline.  

Any comments, or alternative suggestions are welcome.  Many thanks to Rob Elliot, Tom Akehurst and Anurag Kapur for their input to this post.

Comments

  1. Hey Phill, this is good stuff. Nice to see a thorough exploration of this problem as it's far too common and usually neglected until it causes production issues.

    I did a bit of thinking about this a little while back, and came to a similar conclusion to you i.e. that a before/after comparison of the pool state is the way forward. However, my approach was to generate a short burst of load rather than a single request, and then apply a ranged assertion at the end. I implemented this here if you want to take a look (uses Saboteur to create faults that trigger the leak):
    https://github.com/tomakehurst/crash-lab/blob/master/src/test/java/com/tomakehurst/crashlab/ExampleScenarios.java#L108

    ReplyDelete

Post a Comment

Popular posts from this blog

Lessons learned from a connection leak in production

How to connect your docker container to a service on the parent host