After removing a node from SQL Cluster – some machines trying to access the DB.
investigation discovered a ping timeout exactly every 5 minutes.
5 minutes is the default timeout of ARP entries.
Checking the ARP during the ping timeouts discovered the MAC address of the removed SQL node.
Shutting the node down solved the issue.
IP Conflicts…
ReadCommit
Tuesday, June 13, 2017
Tuesday, June 06, 2017
Connect Secured Service Fabric Cluster by Windows Security - from non trusted domain
If you have a Service Fabric cluster which you secured using Windows security - you need to use the following powershell command to connect to it:
If this is not the case - you will fail to login (after long wait in my case).
If you need to connect from a different domain, you can use this workaround:
(Notice: you must run this elevated as Administrator)
the other way is to add this Windows Credential to your Credential Manager in Control Panel:
Then you can simplly call the connect command - It will connect without any problem.
Connect-ServiceFabricCluster -ConnectionEndpoint "ServerName:19000" –WindowsCredentialThis, however, requires you to be logged in as a memeber of the domain of the cluster or a trusted domain.
If this is not the case - you will fail to login (after long wait in my case).
If you need to connect from a different domain, you can use this workaround:
(Notice: you must run this elevated as Administrator)
runas /netonly /user:Domain\Username powershellGive the password when asked for, and this will open a new Powershell where you can run the Connect command succesfully.
the other way is to add this Windows Credential to your Credential Manager in Control Panel:
Then you can simplly call the connect command - It will connect without any problem.
Azure Service Fabric - Windows security using gMSA - Details
On your Domain Controller:
Check if you already have a Kds Key:
Configuring the Security section:
Check if you already have a Kds Key:
Get-KdsRootKey
If not run the next line:
Add-KdsRootKey -EffectiveTime ((get-date).addhours(-10))
Validate it was created:
Get-KdsRootKey
Create the gMSA: (where its name in this sample is gMSA-SF-1, and it has 4 machines SF1,SF2,SF3,SF4)
New-ADServiceAccount -Name gMSA-SF-1 -DNSHostName gMSA-SF-1.myDomain.local -PrincipalsAllowedToRetrieveManagedPassword SF1$,SF2$,SF3$,SF4$ -ServicePrincipalNames ServiceFabric/gMSA-SF-1.myDomain.local
If later on, you need to add/remove nodes:
Don't forget to create a Domain group with all users that should get Admin rights on the cluster using the UI - in the example below the name of this group is "SFAdmins"Set-ADServiceAccount -Identity gMSA-SF-1 -PrincipalsAllowedToRetrieveManagedPassword SF1$, SF2$, SF3$, SF4$, SF5$
On each of the cluster machines (before deploying the Service Fabric Cluster):
Add the Powershell support to manage AD:
Add-WindowsFeature RSAT-AD-PowerShellInstall the gMSA:
Install-AdServiceAccount gMSA-SF-1
"security": {
"ServerCredentialType": "Windows",
"WindowsIdentities": {
"ClustergMSAIdentity": "mydomain.local\\gMSA-SF-1",
"ClusterSPN": "ServiceFabric/gMSA-SF-1.mydomain.local",
"ClientIdentities": [
{
"Identity": "mydomain.local\\SFAdmins",
"IsAdmin": true
}
]
}
},
Azure Service Fabric - Windows security on Standalone - Timeout
while trying to deploy new secured Service Fabric cluster (using Windows Security) i got the following error for all the nodes:
Microsoft's engineer gave me a solution:
When trying the same configiuration file, but without the security section, or with Certificate based security - everything worked perfectly.Timed out waiting for Installer Service to complete for machine SF1.mydomain.local. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
Microsoft's engineer gave me a solution:
Replace your FQDN in the nodes configuration (the "iPAddress" property) to NetBios hostName.This has solved the issue.
Wednesday, May 10, 2017
Public Cloud Only strategy - Uncompetative
I have watched Michael Dell's keynote on DELL EMC World 2017, and he said:
From our experiance, even for QA environmnets that we can take down during the nights - we still pay more on the cloud than On Premise.
Anyone with a Cloud First strategy that is Public Cloud Only - I think will find himself uncompetitive for workloads that have predictable characteristics (see the video here).Maybe this will change over time, but for now I fully agree with this.
From our experiance, even for QA environmnets that we can take down during the nights - we still pay more on the cloud than On Premise.
Monday, May 08, 2017
Crash on my next step
After seeing few human mistakes that take down a fully redundant system - I have added a new tool to my belt.
Good luck, and break a leg - not the system.
Simply tell yourself the following, before doing any configuration change:
Doing the next planned step - I need to crash the whole system - How can I make it happen?Now you'll look for ways that a crash can really happen, instead of being sure you know it won't.
Good luck, and break a leg - not the system.
Friday, November 02, 2012
"Hurricane Sandy’s storm" or "Building Cloud based Disaster Recovery site in record time"
I was planning to write a post on how we have created a DR site from scratch in less than 3 days,
But Lahav Savir - Architect & CEO of Emind Systems (now AllCloud) - was not only working fast in creating this DR site with us, but also writing fast about this :-)
So here is a link to his report:
But Lahav Savir - Architect & CEO of Emind Systems (now AllCloud) - was not only working fast in creating this DR site with us, but also writing fast about this :-)
So here is a link to his report:
Disaster Recovery site Over Night, yes it happened!
Labels:
Amazon,
Cloud,
Disaster Recovery,
DR,
EC2,
Emind,
Hurricane,
Hurricane Sandy,
Sandy
Subscribe to:
Posts (Atom)