ReadCommit: Problem

One of our most important core business services is a WCF soap web service for selling our products to other companies. As we have lots of requests to those services, we must have multiple instances of this service running on several IIS servers - all behind a load balancer.

Problem begins:

Our service was running happily in production for quite a long time.
Than one day, our monitoring team noticed all instances went red in the load balancer health monitor.

EMERGENCY!!!

We immediately started a conference call with all the needed people, and as a first action we decide to recycle some instances and see if they come up ok.

They do. We were back online.

Investigation:

when we look at the event viewer, we see IIS has started to recycle some of the instances we haven’t, because of the following error:

ISAPI 'c:\windows\microsoft.net\framework\v2.0.50727\aspnet_isapi.dll' reported itself as unhealthy for the following reason: 'Deadlock detected'.

This error pointed us to this article:

Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications

In short, the article explains that this error might be caused by the limit of the threads .Net allows to open Simultaneously, where all the allowed thread are used and they waits for something which needs thread as well.

So we have started to think if we may have such scenario in our code, and couldn’t come with this exact scenario, but we thought maybe one of our internal services was blocked and caused all the waiting thread.

Another option which we thought about was DoS attack, but we haven’t seen any increase in calls per seconds.

Anyway – we decided to follow Microsoft recommendations for the runtime configuration settings, like:

maxWorkerThreads
maxIoThreads
minWorkerThreads
minFreeThreads
minLocalRequestFreeThreads

And also WCF settings

MaxConcurrentCalls
MaxConcurrentInstances
MaxConcurrentSessions

It didn’t help – we got the same crash scenario few days later.

Investigating Dumps

We have decided to follow the instructions in “How to generate a dump file when ASP.NET deadlocks in IIS 6.0”, and created dump file the next deadlock happened.

The Dump file hasn’t shown any exception at the deadlock time, however the threads counts gave us the first clue:

Hundreds of ASP.NET HTTP Request threads (System.ServiceModel.Activation.HttpModule.ProcessRequest)
Zero WCF executing threads
(System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin)

Logs to the help

We have collected the logs from IIS, Event Viewer, Load Balancer, Data warehouse DB, and application logs. The picture we saw was:

19:52:51 – Last successful WCF request (IIS log)
19:53:49 – 1 connection_Dropped (HTTPERR log)
19:57:45 – Last successful .aspx page request (IIS log)
20:00:24 - Deadlock Detected (W3SVC-WP Event Viewer)
20:03:35 - 164 * Connection_Abandoned_By_AppPool – most of WCF requests, few aspx pages (HTTPERR)

This gives the same picture as the Dump file:

somehow for almost 5 minutes – aspx files worked perfectly, but WCF requests were not returned to the clients (instead they were stack in the asp.net threads as there weren't any WCF threads to handle them).
All those WCF requests eat up all the available thread from the thread pool, and once finish them all (after the declared ResponseDeadlockInterval passed with no request responded) – Deadlock was declared.

Once the Process recycled – IIS abandoned all those queued requests.

What happened to the WCF?

As you can see, the dump we used was created way too late to find out the cause for the WCF problem.
But, Here logs came to the help again.

we have identified (in the IIS logs) a unique User-Agent which repeats each time just before the services crashed, so we followed this user, and using Network Monitor (Wireshark) we finally arrived to the source of the problem:

This client application has sent us requests with SOAPAction which didn’t fit to any operation in the WCF service

The Bug

We have in our WCF service an attribute - “ErrorHandledWebService” – which inherit from “Attribute” and implemets “IServiceBehavior” & “IErrorHandler” interfaces. This attribute tells WCF to bind our Error handling code the WCF flow.

When exception happens on the WCF request it triggers our “ProvideFault” method of the “IErrorHandler” interface – which we implement to write the original request to our log.

The problem starts when a request comes to the service which doesn’t fit any operation in the service an exception is thrown by WCF.

In this case - OperationContext.Current is null.

Our code used OperationContext.Current.RequestContext.RequestMessage.State which caused a Null Reference Exception which was not handled by our code.

Microsoft’s WCF Bug?

This unhandled exception inside the ProvideFault can cause one of the following:

The Process terminate itself.
The Process stop responding to WCF requests.

those behaviors depends on the value of this parameter:

<legacyUnhandledExceptionPolicy enabled="true/false" />

if this flag is false – the process terminate because of the unhandled exception.
if it is true – WCF is dead, but the rest functionalities (like aspx) continue to work.

as long as you the flag set to false – Microsoft behavior is just fine – crashing the process. Smile

But if the flag is true (also I know Microsoft recommend to go without the legacy mode) - I think it as a bug in the framework ,as WCF shouldn’t leave the process in an unstable state, or at least, it should write proper error message to the event log - explaining WCF is dead because of this.

What do you think?

Consider the following scenario:

You have 15 Web Servers – each serves as a node of SOAP Web Service.
All services are behind a Load Balancer.

All work ok.

Suddenly your monitor shows no external request successfully served.
Ok, not all. just 99% fails…

You are checking the status of the nodes in the load balancer’s pool – All green.
You are checking 4-5 instances by calling them directly (not via the load balancer) – All good.

How could it be?

Easily:

The selected load balancing algorithm was “Least Connections” – which means the balancer select to redirect a coming request to the node which has the smallest number of open connections.
The Health check which decides whether or not an instance is “alive” was simple HTTP GET which return true for any HTTP response (as long there is a response on port 80).
One of the 15 instances was suffering from insufficient resources – which caused it to fail for any request it got.

Now comes the interesting part:

The instance which failed for each request – did it real fast…
The other instances were responding a lot slower (they actually done some work).

The result was that the Load Balancer saw one instance which serves very fast – so almost all the requests finished being redirected to the fault instance – hence – all failed.
This was the problem.

What to do to solve the issue?

Immediate solution:

identify the fault instance (by querying all instances, one by one) – and disable it.
Automatically the Balancer return to distribute the load between all the valid nodes.

Real Solution:

Create a good Health Check.

In our case – we added an aspx file which run the business logic of the request and return “OK” as the output if it succeed. And this page and its result was the new health check.

Conclusion:

Make sure your health check actually check the specific functionality you except to get from the nodes.
This way you promise your Load Balancer will not send requests to faulty nodes, especially not too quick ones.

ReadCommit

Tuesday, June 13, 2017

SQL Cluster connection problem after removing a node

Tuesday, July 05, 2011

ASP.NET Deadlock on WCF service hosted in IIS