Monday, January 17, 2011

Why a good Health Check is so important in your Load Balancer

Consider the following scenario:

You have 15 Web Servers – each serves as a node of SOAP Web Service.
All services are behind a Load Balancer.

All work ok.

Suddenly your monitor shows no external request successfully served.
Ok, not all. just 99% fails…

You are checking the status of the nodes in the load balancer’s pool – All green.
You are checking 4-5 instances by calling them directly (not via the load balancer) – All good.

How could it be?

Easily:

  • The selected load balancing algorithm was “Least Connections” – which means the balancer select to redirect a coming request to the node which has the smallest number of open connections.
  • The Health check which decides whether or not an instance is “alive” was simple HTTP GET which return true for any HTTP response (as long there is a response on port 80).
  • One of the 15 instances was suffering from insufficient resources – which caused it to fail for any request it got.

Now comes the interesting part:

  • The instance which failed for each request – did it real fast
    The other instances were responding a lot slower (they actually done some work).

The result was that the Load Balancer saw one instance which serves very fast – so almost all the requests finished being redirected to the fault instance – hence – all failed.
This was the problem.

What to do to solve the issue?

Immediate solution:

identify the fault instance (by querying all instances, one by one) – and disable it.
Automatically the Balancer return to distribute the load between all the valid nodes.

Real Solution:

Create a good Health Check.

In our case – we added an aspx file which run the business logic of the request and return “OK” as the output if it succeed. And this page and  its result was the new health check.

Conclusion:

Make sure your health check actually check the specific functionality you except to get from the nodes.
This way you promise your Load Balancer will not send requests to faulty nodes, especially not too quick ones.