Tuesday, June 13, 2017

SQL Cluster connection problem after removing a node

After removing a node from SQL Cluster – some machines trying to access the DB.
investigation discovered a ping timeout exactly every 5 minutes.
5 minutes is the default timeout of ARP entries.
Checking the ARP during the ping timeouts discovered the MAC address of the removed SQL node.
Shutting the node down solved the issue.
IP Conflicts…

Tuesday, June 06, 2017

Connect Secured Service Fabric Cluster by Windows Security - from non trusted domain

If you have a Service Fabric cluster which you secured using Windows security - you need to use the following powershell command to connect to it:
Connect-ServiceFabricCluster -ConnectionEndpoint "ServerName:19000" –WindowsCredential
This, however, requires you to be logged in as a memeber of the domain of the cluster or a trusted domain.
If this is not the case - you will fail to login (after long wait in my case).

If you need to connect from a different domain, you can use this workaround:
(Notice: you must run this elevated as Administrator)
runas /netonly /user:Domain\Username powershell
Give the password when asked for, and this will open a new Powershell where you can run the Connect command succesfully.

the other way is to add this Windows Credential to your Credential Manager in Control Panel:

Then you can simplly call the connect command - It will connect without any problem.

Azure Service Fabric - Windows security using gMSA - Details

On your Domain Controller:

Check if you already have a Kds Key:
Get-KdsRootKey
If not run the next line:
Add-KdsRootKey -EffectiveTime ((get-date).addhours(-10))
Validate it was created:
Get-KdsRootKey
Create the gMSA: (where its name in this sample is gMSA-SF-1, and it has 4 machines SF1,SF2,SF3,SF4)
New-ADServiceAccount -Name gMSA-SF-1 -DNSHostName gMSA-SF-1.myDomain.local -PrincipalsAllowedToRetrieveManagedPassword SF1$,SF2$,SF3$,SF4$ -ServicePrincipalNames ServiceFabric/gMSA-SF-1.myDomain.local
If later on, you need to add/remove nodes:
Set-ADServiceAccount -Identity gMSA-SF-1 -PrincipalsAllowedToRetrieveManagedPassword SF1$, SF2$, SF3$, SF4$, SF5$
Don't forget to create a Domain group with all users that should get Admin rights on the cluster using the UI - in the example below the name of this group is "SFAdmins"

On each of the cluster machines (before deploying the Service Fabric Cluster):

Add the Powershell support to manage AD:
Add-WindowsFeature RSAT-AD-PowerShell
Install the gMSA:
Install-AdServiceAccount gMSA-SF-1
Configuring the Security section:

"security": {
            "ServerCredentialType": "Windows",
            "WindowsIdentities": {
                "ClustergMSAIdentity": "mydomain.local\\gMSA-SF-1",
                "ClusterSPN": "ServiceFabric/gMSA-SF-1.mydomain.local",
                "ClientIdentities": [
                    {
                        "Identity": "mydomain.local\\SFAdmins",
                        "IsAdmin": true
                    }
                ]
            }

        },

Azure Service Fabric - Windows security on Standalone - Timeout

while trying to deploy new secured Service Fabric cluster (using Windows Security) i got the following error for all the nodes:
Timed out waiting for Installer Service to complete for machine SF1.mydomain.local. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
 When trying the same configiuration file, but without the security section, or with Certificate based security - everything worked perfectly.

Microsoft's engineer gave me a solution:
Replace your FQDN in the nodes configuration (the "iPAddress" property) to NetBios hostName.
This has solved the issue.

Wednesday, May 10, 2017

Public Cloud Only strategy - Uncompetative

I have watched Michael Dell's keynote on DELL EMC World 2017, and he said:
Anyone with a Cloud First strategy that is Public Cloud Only - I think will found himself uncompetitive for workloads that have predictable characteristics (see the video here).
Maybe this will change over time, but for now I fully agree with this.

From our experiance, even for QA environmnets that we can take down during the nights - we still pay more on the cloud than On Premise.

Monday, May 08, 2017

Crash on my next step

After seeing few human mistakes that take down a fully redundant system - I have added a new tool to my belt.

Simply tell yourself the following, before doing any configuration change:
Doing the next planned step - I need to crash the whole system - How can I make it happen?
Now you'll look for ways that a crash can really happen, instead of being sure you know it won't.

Good luck, and break a leg - not the system.

Friday, November 02, 2012

"Hurricane Sandy’s storm" or "Building Cloud based Disaster Recovery site in record time"

I was planning to write a post on how we have created a DR site from scratch in less than 3 days,
But Lahav Savir - Architect & CEO of Emind Systems (now AllCloud) - was not only working fast in creating this DR site with us, but also writing fast about this :-)

So here is a link to his report:
Disaster Recovery site Over Night, yes it happened!