Wednesday, May 10, 2017

Public Cloud Only strategy - Uncompetative

I have watched Michael Dell's keynote on DELL EMC World 2017, and he said:
Anyone with a Cloud First strategy that is Public Cloud Only - I think will found himself uncompetitive for workloads that have predictable characteristics (see the video here).
Maybe this will change over time, but for now I fully agree with this.

From our experiance, even for QA environmnets that we can take down during the nights - we still pay more on the cloud than On Premise.

Monday, May 08, 2017

Crash on my next step

After seeing few human mistakes that take down a fully redundant system - I have added a new tool to my belt.

Simply tell yourself the following, before doing any configuration change:
Doing the next planned step - I need to crash the whole system - How can I make it happen?
Now you'll look for ways that a crash can really happen, instead of being sure you know it won't.

Good luck, and break a leg - not the system.

Friday, November 02, 2012

"Hurricane Sandy’s storm" or "Building Cloud based Disaster Recovery site in record time"

I was planning to write a post on how we have created a DR site from scratch in less than 3 days,
But Lahav Savir - Architect & CEO of Emind Systems (now AllCloud) - was not only working fast in creating this DR site with us, but also writing fast about this :-)

So here is a link to his report:
Disaster Recovery site Over Night, yes it happened!

Sunday, June 17, 2012

PowerShell - Memory usage per user

From time to time you might need to find out which user to logoff from a server because of low memory.

instead of looking at the Task Manager and trying to find who is using the most, you can use this PowerShell script:

$h = @{}
get-wmiobject win32_process | foreach {
    $u = $_.getowner().user;
    if ( $u -ne $null)
        if ( !$h.ContainsKey($u)  )
            $h.add( $u, $_.WS);
            $h.item($u) = $h.item($u) + $_.WS;

$h.GetEnumerator() | sort value -desc
Notice that this gives you the actual RAM usage which might not be what you are looking for, as recent users probably using more RAM than not recent ones.

If you want to know the total usage of Virtual Memory - replace "WS" with "VM".

Tuesday, July 05, 2011

ASP.NET Deadlock on WCF service hosted in IIS

One of our most important core business services is a WCF soap web service for selling our products to other companies. As we have lots of requests to those services, we must have multiple instances of this service running on several IIS servers - all behind a load balancer.

Problem begins:

Our service was running happily in production for quite a long time.
Than one day, our monitoring team noticed all instances went red in the load balancer health monitor.


We immediately started a conference call with all the needed people, and as a first action we decide to recycle some instances and see if they come up ok.

They do. We were back online.


when we look at the event viewer, we see IIS has started to recycle some of the instances we haven’t, because of the following error:

ISAPI 'c:\windows\\framework\v2.0.50727\aspnet_isapi.dll' reported itself as unhealthy for the following reason: 'Deadlock detected'.

This error pointed us to this article:

Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications

In short, the article explains that this error might be caused by the limit of the threads .Net allows to open Simultaneously, where all the allowed thread are used and they waits for something which needs thread as well.

So we have started to think if we may have such scenario in our code, and couldn’t come with this exact scenario, but we thought maybe one of our internal services was blocked and caused all the waiting thread.

Another option which we thought about was DoS attack, but we haven’t seen any increase in calls per seconds.

Anyway – we decided to follow Microsoft recommendations for the runtime configuration settings, like:

  • maxWorkerThreads
  • maxIoThreads
  • minWorkerThreads
  • minFreeThreads
  • minLocalRequestFreeThreads

And also WCF settings

  • MaxConcurrentCalls
  • MaxConcurrentInstances
  • MaxConcurrentSessions

It didn’t help – we got the same crash scenario few days later.

Investigating Dumps

We have decided to follow the instructions in “How to generate a dump file when ASP.NET deadlocks in IIS 6.0”, and created dump file the next deadlock happened.

The Dump file hasn’t shown any exception at the deadlock time, however the threads counts gave us the first clue:

  • Hundreds of ASP.NET HTTP Request threads (System.ServiceModel.Activation.HttpModule.ProcessRequest)
  • Zero WCF executing threads

Logs to the help

We have collected the logs from IIS, Event Viewer, Load Balancer, Data warehouse DB, and application logs. The picture we saw was:

19:52:51 – Last successful WCF request (IIS log)
19:53:49 – 1 connection_Dropped (HTTPERR log)
19:57:45 – Last successful .aspx page request (IIS log)
20:00:24 - Deadlock Detected (W3SVC-WP Event Viewer)
20:03:35 - 164 * Connection_Abandoned_By_AppPool – most of WCF requests, few aspx pages (HTTPERR)

This gives the same picture as the Dump file:

somehow for almost 5 minutes – aspx files worked perfectly, but WCF requests were not returned to the clients (instead they were stack in the threads as there weren't any WCF threads to handle them).
All those WCF requests eat up all the available thread from the thread pool, and once finish them all (after the declared ResponseDeadlockInterval passed with no request responded) – Deadlock was declared.

Once the Process recycled – IIS abandoned all those queued requests.

What happened to the WCF?

As you can see, the dump we used was created way too late to find out the cause for the WCF problem.
But, Here logs came to the help again.

we have identified (in the IIS logs) a unique User-Agent which repeats each time just before the services crashed, so we followed this user, and using Network Monitor (Wireshark) we finally arrived to the source of the problem:

This client application has sent us requests with SOAPAction which didn’t fit to any operation in the WCF service

The Bug

We have in our WCF service an attribute - “ErrorHandledWebService” – which inherit from “Attribute” and implemets “IServiceBehavior” & “IErrorHandler” interfaces. This attribute tells WCF to bind our Error handling code the WCF flow.

When exception happens on the WCF request it triggers our “ProvideFault” method of the “IErrorHandler” interface – which we implement to write the original request to our log.

The problem starts when a request comes to the service which doesn’t fit any operation in the service an exception is thrown by WCF.

In this case - OperationContext.Current is null.

Our code used OperationContext.Current.RequestContext.RequestMessage.State which caused a Null Reference Exception which was not handled by our code.

Microsoft’s WCF Bug?

This unhandled exception inside the ProvideFault can cause one of the following:

  1. The Process terminate itself.
  2. The Process stop responding to WCF requests.

those behaviors depends on the value of this parameter:

<legacyUnhandledExceptionPolicy enabled="true/false" />

if this flag is false – the process terminate because of the unhandled exception.
if it is true – WCF is dead, but the rest functionalities (like aspx) continue to work.

as long as you the flag set to false – Microsoft behavior is just fine – crashing the process. Smile

But if the flag is true (also I know Microsoft recommend to go without the legacy mode) - I think it as a bug in the framework ,as WCF shouldn’t leave the process in an unstable state, or at least, it should write proper error message to the event log - explaining WCF is dead because of this.

What do you think?

Tuesday, March 15, 2011

service has zero application (non-infrastructure) endpoints

I got this message today after copying a working web application from one server to the other.
The web application was configured as virtual directory on both places.
Same exact web.config and same DLLs.

It turns out to be the website folder itself:

In the original working machine the website folder was valid and empty, while on the problematic server – the website was pointing to not existing folder.

I guess that when WCF looks for the web.config – it has the rule to go up to see if there are any web.config in the hierarchy above – and it failed on the not existing folder.

Hope it will help someone.

Monday, January 17, 2011

Why a good Health Check is so important in your Load Balancer

Consider the following scenario:

You have 15 Web Servers – each serves as a node of SOAP Web Service.
All services are behind a Load Balancer.

All work ok.

Suddenly your monitor shows no external request successfully served.
Ok, not all. just 99% fails…

You are checking the status of the nodes in the load balancer’s pool – All green.
You are checking 4-5 instances by calling them directly (not via the load balancer) – All good.

How could it be?


  • The selected load balancing algorithm was “Least Connections” – which means the balancer select to redirect a coming request to the node which has the smallest number of open connections.
  • The Health check which decides whether or not an instance is “alive” was simple HTTP GET which return true for any HTTP response (as long there is a response on port 80).
  • One of the 15 instances was suffering from insufficient resources – which caused it to fail for any request it got.

Now comes the interesting part:

  • The instance which failed for each request – did it real fast
    The other instances were responding a lot slower (they actually done some work).

The result was that the Load Balancer saw one instance which serves very fast – so almost all the requests finished being redirected to the fault instance – hence – all failed.
This was the problem.

What to do to solve the issue?

Immediate solution:

identify the fault instance (by querying all instances, one by one) – and disable it.
Automatically the Balancer return to distribute the load between all the valid nodes.

Real Solution:

Create a good Health Check.

In our case – we added an aspx file which run the business logic of the request and return “OK” as the output if it succeed. And this page and  its result was the new health check.


Make sure your health check actually check the specific functionality you except to get from the nodes.
This way you promise your Load Balancer will not send requests to faulty nodes, especially not too quick ones.