One of our most important core business services is a WCF soap web service for selling our products to other companies. As we have lots of requests to those services, we must have multiple instances of this service running on several IIS servers - all behind a load balancer.
Problem begins:
Our service was running happily in production for quite a long time.
Than one day, our monitoring team noticed all instances went red in the load balancer health monitor.
EMERGENCY!!!
We immediately started a conference call with all the needed people, and as a first action we decide to recycle some instances and see if they come up ok.
They do. We were back online.
Investigation:
when we look at the event viewer, we see IIS has started to recycle some of the instances we haven’t, because of the following error:
ISAPI 'c:\windows\microsoft.net\framework\v2.0.50727\aspnet_isapi.dll' reported itself as unhealthy for the following reason: 'Deadlock detected'.
This error pointed us to this article:
In short, the article explains that this error might be caused by the limit of the threads .Net allows to open Simultaneously, where all the allowed thread are used and they waits for something which needs thread as well.
So we have started to think if we may have such scenario in our code, and couldn’t come with this exact scenario, but we thought maybe one of our internal services was blocked and caused all the waiting thread.
Another option which we thought about was DoS attack, but we haven’t seen any increase in calls per seconds.
Anyway – we decided to follow Microsoft recommendations for the runtime configuration settings, like:
- maxWorkerThreads
- maxIoThreads
- minWorkerThreads
- minFreeThreads
- minLocalRequestFreeThreads
And also WCF settings
- MaxConcurrentCalls
- MaxConcurrentInstances
- MaxConcurrentSessions
It didn’t help – we got the same crash scenario few days later.
Investigating Dumps
We have decided to follow the instructions in “How to generate a dump file when ASP.NET deadlocks in IIS 6.0”, and created dump file the next deadlock happened.
The Dump file hasn’t shown any exception at the deadlock time, however the threads counts gave us the first clue:
- Hundreds of ASP.NET HTTP Request threads (System.ServiceModel.Activation.HttpModule.ProcessRequest)
- Zero WCF executing threads
(System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin)
Logs to the help
We have collected the logs from IIS, Event Viewer, Load Balancer, Data warehouse DB, and application logs. The picture we saw was:
19:52:51 – Last successful WCF request (IIS log)
19:53:49 – 1 connection_Dropped (HTTPERR log)
19:57:45 – Last successful .aspx page request (IIS log)
20:00:24 - Deadlock Detected (W3SVC-WP Event Viewer)
20:03:35 - 164 * Connection_Abandoned_By_AppPool – most of WCF requests, few aspx pages (HTTPERR)
This gives the same picture as the Dump file:
somehow for almost 5 minutes – aspx files worked perfectly, but WCF requests were not returned to the clients (instead they were stack in the asp.net threads as there weren't any WCF threads to handle them).
All those WCF requests eat up all the available thread from the thread pool, and once finish them all (after the declared ResponseDeadlockInterval passed with no request responded) – Deadlock was declared.
Once the Process recycled – IIS abandoned all those queued requests.
What happened to the WCF?
As you can see, the dump we used was created way too late to find out the cause for the WCF problem.
But, Here logs came to the help again.
we have identified (in the IIS logs) a unique User-Agent which repeats each time just before the services crashed, so we followed this user, and using Network Monitor (Wireshark) we finally arrived to the source of the problem:
This client application has sent us requests with SOAPAction which didn’t fit to any operation in the WCF service
The Bug
We have in our WCF service an attribute - “ErrorHandledWebService” – which inherit from “Attribute” and implemets “IServiceBehavior” & “IErrorHandler” interfaces. This attribute tells WCF to bind our Error handling code the WCF flow.
When exception happens on the WCF request it triggers our “ProvideFault” method of the “IErrorHandler” interface – which we implement to write the original request to our log.
The problem starts when a request comes to the service which doesn’t fit any operation in the service an exception is thrown by WCF.
In this case - OperationContext.Current is null.
Our code used OperationContext.Current.RequestContext.RequestMessage.State which caused a Null Reference Exception which was not handled by our code.
Microsoft’s WCF Bug?
This unhandled exception inside the ProvideFault can cause one of the following:
- The Process terminate itself.
- The Process stop responding to WCF requests.
those behaviors depends on the value of this parameter:
<legacyUnhandledExceptionPolicy enabled="true/false" />
if this flag is false – the process terminate because of the unhandled exception.
if it is true – WCF is dead, but the rest functionalities (like aspx) continue to work.
as long as you the flag set to false – Microsoft behavior is just fine – crashing the process.
But if the flag is true (also I know Microsoft recommend to go without the legacy mode) - I think it as a bug in the framework ,as WCF shouldn’t leave the process in an unstable state, or at least, it should write proper error message to the event log - explaining WCF is dead because of this.
What do you think?