Availability Group Seeding

First off, I had some fun with the AI generated images with this. I think silly images are the way to go.

Automatic seeding for Availability Groups is one of those features that’s fantastic when it works and incredibly frustrating when it doesn’t. When seeding is healthy, databases just show up on the secondary and life is good. When it’s not, you’re left staring at vague status messages, wondering whether anything is actually happening at all. I really hate how the GUI handles this, because if the seeding is working or not, you get no feedback whatsoever until its basically done.

Luckily, there are scripts to help here, but if you don’t have them handy, you aren’t getting any information.

The first place I always check is:

SELECT * FROM sys.dm_hadr_automatic_seeding;

This DMV tells you whether seeding started, whether it failed, how many retries have occurred, and whether SQL Server recorded an error code. If seeding failed outright, this is usually where you’ll see the first clue as to why.

Next:

SELECT * FROM sys.dm_hadr_physical_seeding_stats;

When things are healthy, this view can show progress and estimated completion. When things are not healthy, it can be empty, partially populated, or frozen in a state that never changes. So you can use that knowledge; if seeding is supposedly “in progress” but this DMV isn’t showing anything, something is wrong.

Check Whether Data Is Actually Moving (Secondary)

That gnawing question is almost answered. Is anything actually happening right now?

On the secondary replica, I use performance counters to answer that question. This script samples backup/restore throughput over a short window to see if seeding activity is occurring:

----- RUN ON SECONDARY ------
-- Test if there are processes for the seeding occurring right now

IF OBJECT_ID('tempdb..#Seeding') IS NOT NULL DROP TABLE #Seeding;

SELECT GETDATE() AS CollectionTime,
instance_name,
cntr_value
INTO #Seeding
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Backup/Restore Throughput/sec';

WAITFOR DELAY '00:00:05';

SELECT LTRIM(RTRIM(p2.instance_name)) AS [DatabaseName],
(p2.cntr_value - p1.cntr_value)
/ DATEDIFF(SECOND, p1.CollectionTime, GETDATE()) AS ThroughputBytesSec
FROM sys.dm_os_performance_counters AS p2
INNER JOIN #Seeding AS p1
ON p2.instance_name = p1.instance_name
WHERE p2.counter_name LIKE 'Backup/Restore Throughput/sec%'
ORDER BY
ThroughputBytesSec DESC;

If you see throughput here, seeding is still moving data, even if the DMVs look suspicious. If you see nothing, seeding is probably broken.

Restarting Seeding (Without Restarting SQL)

When seeding is stuck, sometimes the fastest path forward is to effectively “kick” the process. On the primary replica, toggling the seeding mode can force SQL Server to restart the automatic seeding workflow:

-----*** RUN ON PRIMARY ******-----
-- Change to your AG name and server names

ALTER AVAILABILITY GROUP
MODIFY REPLICA ON 'SecondaryServer1'
WITH (SEEDING_MODE = AUTOMATIC);

ALTER AVAILABILITY GROUP
MODIFY REPLICA ON 'SecondaryServer2'
WITH (SEEDING_MODE = AUTOMATIC);

This isn’t magic, and it doesn’t fix underlying problems like permissions, disk space, or network throughput, but it often clears up cases where seeding simply stopped progressing for no obvious reason. I use these scripts all the time to verify that there is data movement happening on an AG that stopped syncing over night or after a patch.

A Cautionary Tale About “Helpful” AI

I like to test AI to see what it suggests on problems I’m troubleshooting. Lots of times it tells me what I already know, but one time an AI tool confidently suggested a SQL command that would “restart AG data movement.”

That sounded amazing. I got excited. This must be a new script I didn’t know about from a new release?

No…It didn’t exist. It was just a hallucination.

AI can be a great accelerator, but you still need to verify everything against reality. Especially when something sounds too good to be true.

Final Thoughts

AG seeding failures are rarely caused by one thing, and no single DMV tells the whole story. You have to look at:

  • Seeding status and error codes
  • Physical progress
  • Data movement
  • And sometimes, force SQL Server to reattempt the process

The good news is that with the right scripts and a little patience, most seeding issues can be diagnosed without guesswork. The bad news is that when things break, SQL Server is still not very good at telling you why unless you know exactly where to look.

Hopefully, scripts like these save you a little time the next time seeding decides to go wrong.

Power BI: Easy Data Wins – but Annoying SQL Connections

I want to talk about two things about Power BI today. First, I will give a generic pitch for why it is so great. Then, I will discuss a specific gripe I have about the connection window which I find lacking.

Power BI: One of the Easiest Wins in Data

I’ve been working with data for a long time. One thing that hasn’t changed is how much easier everything gets once the data is in a pretty picture. Whether you’re troubleshooting a weird performance spike or trying to understand a trend, a simple visualization can be helpful. It can also aid in making sense of raw logs. It can highlight things you’d never catch in a wall of text.

But that’s probably obvious to more data driven people. Power BI is also awesome for people who wouldn’t normally think it is for them. I’ve taught Power BI basics to dozens of people. So far, I think everyone (even the skeptics) appreciated it once I explained how useful it can be.

Developers, DBAs, analysts, managers, administrators…almost anyone can find a good use for Power BI. You don’t have to be a reporting expert to get value out of it. Ingest a dataset from a CSV. Drag a couple visuals in. Suddenly, you can explain something in 10 seconds that used to take a 20-minute conversation and a whiteboard. You can send a quick report to your manager to showcase a win or a loss. It can help you visualize with the visualizations.

Even if you don’t plan on publishing dashboards or rolling out a reporting platform, Power BI is great as a personal tool. You can explore a dataset, build a quick visual, and understand the patterns. Then, you can move on. It turns “staring at CSV files” into something productive and dare I say, fun. There are other tools like it, but I think Power BI is perhaps the easiest to get started in. I still love SSRS for easier complex reports, but the performance is horrible, and it is deprecated. Databricks has Dashboards that are shockingly similar to Power BI. If you know how to use one tool, you’ll be able to use the other one with minimal effort.

Direct SQL Connection Woes

I do have a few annoyances with the Power BI, and the one I want to mention today is how it handles direct SQL connections.

If your SQL Server doesn’t have a trusted connection, but encryption is not forced, the native SQL connection will complain (see below), but still let you in.

Image

However, if you are forcing encryption and have an untrusted certificate, things get bad. Ideally you want to have your certificate issued from a trusted certificate authority, but I know this doesn’t always happen quickly. So…unlike SSMS, there is no checkbox for trusting the certificate. You’ll just get a connection error about the untrusted cert.

Image

This really irked me and I thought I was at an impasse until I found a connection work around.

The Workaround: Use the OLE DB Connection Instead

The good news is that Power BI has other connection types. One option here is using OLE DB. It’s not as obvious or user-friendly as the standard SQL connector, but it gives you something the default connection doesn’t: the trust certificate checkbox.

Image
Image

Bottom Line: Get a Trusted Certificate

This workaround shouldn’t replace proper security. If your SQL Server has an untrusted certificate, the real solution is to fix the certificate. I’ll admit I’m too lazy on most of my dev boxes to do that, but it’s the right way for production.

I’ll probably post a blog sometime about all the nuances of requesting, building, and the requirements for a SQL certificate, but that’s for another day.

Understanding SQL Audit Filters: A Guide for DISA STIG Compliance

I’m going to start the blogging back up with a topic that is near and dear to my heart, and something that has bugged me for years.

Every time I work with SQL auditing, especially in environments governed by DISA STIG requirements, I’m reminded how misunderstood, misconfigured, and frankly incomplete everyone leaves their auditing process.

Audits matter. A lot. They are one of the few artifacts that let you trace what actually happened on a SQL Server instance: the who, what, when, where, and how. When something breaks, or something suspicious happens, the audit is where you go to reconstruct the truth, or at least I’d like to say that.

But who actually digs into them? They are a huge pain to deal with.

Audit are normally just a firehose of noise. Unfiltered audits capture everything, including internal system operations SQL Server does behind the scenes thousands of times a minute. If you’ve ever tried to load a 20GB audit file in SSMS and waited long enough to rethink your life choices, you understand. First of all, letting a single audit file get that huge is a huge mistake, but there are so many reasons it might happen.

Filters: Important, Necessary… and Rarely Understood

If you run a DISA STIG compliant audit, you are capturing over 30 audit action items which translates to hundreds of gigabytes of audit data in a day, easily. The irony is that while filters are essential to reducing audit volume to something manageable, almost no one has tried to add a filter, and even fewer actually understand how to create them. And honestly, I can’t blame them. Audit filters are confusing and poorly documented.

Image

To make it even worse, a lot of organizations think “filters = less auditing = bad,” so they leave everything on. All that does is ensure no one ever reviews the audit, because no one has time to wade through millions of rows of system chatter. Combine that with the dreaded action item SCHEMA_OBJECT_ACCESS_GROUP and you are looking at more data than you can shake a stick at.

Which leads me to the part that drives me crazy.

The STIG Problem: Required to Log It… but Never Required to Review It

If you work with DISA SQL Server STIGs, you already know how much of the checklist focuses on required audit actions. Historically dozens of STIGIDs cover dozens of audit action items.

But not a single STIGID actually requires proof of review.

  • We enforce that the audit must exist.
  • We enforce that certain actions must be captured.
  • We enforce retention (kind of), location, configuration, offloading…

…but nowhere do we enforce that the organization has to actually look at the audit on a scheduled basis. We do at least enforce testing backups.

I’ve seen organizations that never look at their audits. They just lock them away and hope the files don’t eat up too much storage space. Sometimes I was successful in teaching them how important those audit files could be, sometimes I wasn’t.

I’d love to see a STIGID that requires scheduled, documented review of the audit logs. What’s the point of collecting all this data if it only gets looked at after an incident? The problem is, I was the main person advising DISA on STIG changes and improvements for the last few years. There are other colleagues still at Microsoft who could champion this cause, but I don’t think they have the time any more.

SQL Audits Are Slow — Painfully Slow

Even if someone wanted to review the audit regularly, SQL Server doesn’t make it easy. Audits grow large quickly, and the built-in functions for reading audit files are notoriously slow.

You can easily hit scenarios where:

  • A single audit file takes minutes to open
  • A month of audit history takes hours to load
  • “Daily review” becomes impossible simply due to the cost of reading the files

This is another reason filters are so important.

The 2016 Filter Mistake (and Why It Still Haunts Us)

One of the biggest audit-filtering issues came from the SQL Server 2016 STIG. The wording was wrong on the STIGID itself. It told administrators not to filter “administrative permissions.” But that was not what the underlying SRG says.

The SRG actually says you cannot filter out “direct database access”.

I corrected this mistake in the 2022 SQL STIG, so newer environments are finally getting the right guidance in the Check Content. The 2016 STIG still has the incorrect language, and now that I’m no longer at Microsoft, I don’t know if it will ever be fixed. Since the 2016 STIG should be sunset soon, it may not be a problem much longer, but it still haunts me.

The worst part is that even the SQL Server product team didn’t want to touch the filter question. The phrase “administrative activities” was so vague that no one would explicitly approve or deny a filter as STIG-compliant when I asked. I couldn’t get a straight answer internally so that I could provide a STIG compliant audit filter in the Fix Text of the audit creation STIGIDs.

The filter was long and complicated. I’ll spare you the full wall of code in this blog, but imagine dozens of NOT clauses filtering sys tables that you probably have never heard of or want to hear of.

Image

Even with explanations, few felt comfortable deploying the audit filter. It was too long and too confusing to read.

Where Do We Go From Here?

There’s no perfect answer here. A unified STIG for SQL Server, not broken out by version, is my dream. This would keep changes more up to date and easier to maintain. A magically faster audit reading solution would also be ideal. I’ve helped build and maintain an automated Audit Data Warehouse in the past, but it was never widely adopted because even trying to summarize the audit data for regular review took massive amounts of storage space. What you can do right now though?

  • Use audit filters, but test them thoroughly and document their intent
  • Separate system noise from user-driven events if you can
  • Keep audit files small enough to review regularly, many small files are easier to read than few huge files
  • Define a documented review schedule (even if the STIG doesn’t require it), use those audits for what they are intended
  • Push for clearer guidance in future STIG cycles. You can request changes from DISA too, they do listen to customers, they just vet those changes through the vendor and their own SMEs before a change cycle (which is about 6 months).

I plan to continue working with the SQL STIGs even from outside Microsoft. Security is important to me, and having that be consistent and actionable is paramount.

Changing Roles and Blogging Again

It’s been a while since I’ve posted here…ok it’s been a very, very long time.

After 9 incredible years at Microsoft, things have changed and my career is changing directions. Leaving wasn’t easy, but you can’t always stay and be complacent. More importantly, why did I stop blogging? Well, honestly once I moved to Microsoft I didn’t know where I should blog. I wondered if I should blog using an internal account there or if I could even safely keep blogging about topics related to work as before. It just felt easier to stop entirely, and I was busy too, so it made the decision easier. Now, I need to get into the habit of writing again.

Working at Microsoft was the highlight of my professional life so far. I considered it my capstone company to work for, and moving on from it hurts, and I will miss my colleagues that I had built relationships with other the years. I’m an introvert and never really thought I’d miss coworkers this much, but I also was at Microsoft for a long time, so it makes sense.

I’ve had the chance to collaborate with some of the smartest people in the industry, help customers solve complex SQL Server and Azure problems, and learn more than I ever imagined. It was a wild and great ride and I loved every minute of it, even as things changed and I had to adapt. I’ll always be grateful for the experiences, the mentorship, and the friendships I gained.

So what’s next? That’s still taking shape. It is equally stressful and exciting. I’m exploring opportunities that will build on the skills I’ve learned and the brand I’ve built for myself. The idea for building my brand as a consultant is very enticing, but that is going to be a long and delicate process.

In the meantime, I plan to revive this blog and use it as a place to document ideas and work arounds just as before. Learning to write in the time of AI will be odd though, so many things I blogged about before are just a quick prompt away from an answer…but then again, I’ve long enjoyed asking AI for advice on topics only for them to tell me something along the lines of “consult with a professional or Microsoft support to assist if this does not work.” Which of course I was, so it always got a chuckle out of me.

The regular writing was good for me, and it helped cement knowledge I gained along the way. I want to return to what originally made SQL Sanctum special: sharing experiences, insights, and lessons learned (sometimes the hard way).

Thanks to everyone who supported me over the years, and I look forward to expanding my network. Expect more posts on SQL, PowerShell, Azure, Python, Machine Learning, Power BI, and of course my niche and expert skill in SQL STIGs.

SSMS 2016 Policy Management Quote Parsing Error

I discovered a bug today in 2016 Management Studio when creating and updating policies. It drove me crazy until I realized what was going on, causing lots of lost time. Hopefully this will get fixed fast; we are reporting it immediately because I couldn’t find any references to it already out there. Special thanks to Kenneth Fisher for helping confirm that it wasn’t just affecting me.

The Problem

In the latest release of SSMS 2016, 16.5.1 and newer, policy conditions are removing quotes on each save, causing parse errors.

Vote the Connect up. A fix for this should be released in the next few weeks, but it doesn’t hurt to show your support for Policy Based Management.

Example

I’ll walk through a full, simplified policy creation showing how I discovered the problem, but it can be recreated by just editing a condition.

I created a new policy named Test and a new condition, also named Test. I set the condition facet to server, and input the following code into the field to create an ExecuteSql statement. Everything requiring a quote inside of the string has to have at least double quotes.


Executesql('string',' Select ''One'' ')

conditionscript

Once the code was input, you can see below that the code parsed correctly. SSMS was happy with it, so I hit OK to continue.

conditionready

I finished creating the policy, everything was still looking fine.

createpolicy

I then went to Evaluate the policy. The policy failed, as I expected. That’s not the point. If you look closely, you’ll notice that the Select One statement is no longer surrounded by double quotes. That shouldn’t have happened.

evalresults

I opened the Condition itself and received a parse error. Without the required double quotes, the Condition was broken.

parseerror

Summary
I tested this by creating or editing a condition without a policy or evaluating it and got the same results using SSMS 2016 on two separate computers, versions 16.5.1 and 17.0 RC1. When using SSMS 2012 or 2014, the code was not altered, everything worked as it should have. Finally, Kenneth happened to have an older version of SSMS 2016 and could not reproduce my error until he updated to the latest version of SSMS 2016, indicating that it is a recently introduced bug.

And again, if you haven’t already, vote up the Connect item.

Hyper-V VM Network Connectivity Troubleshooting

Last week I detailed my problems in creating a Virtual Machine in Hyper-V after not realizing that I had failed to press any key and thus start the boot process. Well, I had another problem with Hyper-V after that. Getting the internet working on my VM turned out to be another lesson in frustration. Worse, there was no good explanation for the problem this time.

Problem: A new VM has no internet connectivity even though a virtual switch was created and has been specified.

nonetworkaccess

Solution: Getting the internet working on my VM was a multistep process, and I can’t really say exactly what fixed it. Here are the steps I tried though:

RESET EVERYTHING!

Sadly, that is the best advice I can give you. If you have created a virtual switch, and the internet isn’t working correctly, select everything, and then uncheck whatever settings you don’t actually want. It sounds screwy, but it worked for me. This forces the VM to reconfigure the settings and resets connectivity.

Supposedly the only important setting on the virtual switch properties would be to ensure that you have Allow management operating system to share this network adapter. That will allow your computer and your VM to both have internet access. When I first set this, however, the PC lost internet while the VM had an incredibly slow connection. Needless to say, that was not good enough. Disabling the option did nothing but revert back to my original problem though.

For good measure, I then checked Enable virtual LAN identification  for management operating system. Nothing special still, but I left it to continue troubleshooting. Later, I would uncheck that feature, but I wanted results first.

SwitchSettings.png

Next I went into the Network Adapter properties and checked Enable virtual LAN identification. This is another setting I would later turn back off.

LanSetting.png

Finally I restarted my PC, restarted the Virtual Machine, and for some reason, I then had consistent internet on both the VM and the PC.

Ultimately, the problem was that features needed to be reset. I’m still not sure specifically which one had to be turned on and off again, but toggling everything and restarting worked well enough for me in this case. I was just tired of fighting with it by the time it was working.

At least now I have a VM running JAVA so it won’t touch my real Operating System.

Hyper-V VM Troubleshooting

I’ve made VMs before in Hyper-V, it’s a nice way to keep things separate from your main OS and test out configurations. When you haven’t used it lately, it can also be a lesson in frustration.

My solution? It was just embarrassing.

I had a VM set up working fine, however, I didn’t need that OS anymore, and wanted a brand new VM to play with. I spun up a new VM with the same configuration settings as last time, just a different OS. Every time that I tried to boot the VM, I got the same error though.

 

bootfailure

The boot loader failed – time out.

 

Maybe the new ISO file was corrupt? I switched back to the original that worked for Server 2012R2 in my old VM. That didn’t make a difference.

I hunted online, I asked around. There were a few suggestions.

Review Configuration Settings. Maybe I screwed up the configuration? I rebuilt the VM and made sure all the file paths were perfect, with a new Virtual Hard Disk, just in case I had moved files or changed some folders. That didn’t change anything though.

Disable Secure Boot. I heard that caused OS boot failures. Except that didn’t change anything, and it didn’t really apply to my situation.

Unblock the files. I hear that’s always a problem on new downloads, but I”ve never seen it actually happen to me. My problems are never that simple. This was the first time I actually checked the file properties and – they were blocked! I was very excited, but this did not make a difference. It’s still a good idea to check this anytime you run a new file as it is a common issue.

unblock

The Solution

Finally, at wits end, I reopened the VM console and started the machine, and tried it again. I smashed the keyboard in frustration as it came up. This time, it went straight to installing Windows.

My nemesis in this case was a simple five word phrase that disappeared almost instantly.

Press any key to continue...

It only shows up for a couple seconds at most, and if you start the VM before you connect to it, you’ll never have a chance to hit a key. VMs don’t automatically go into boot mode, instead they just try to load the (non)existing OS.

So after all that confusion, I just wasn’t hitting a key FAST enough. Sure all those other things can be important and you should always verify your settings, but it shouldn’t have been this difficult.

Next week I’ll share the fun I had trying to get internet connectivity on my VM…

 

 

Get and Set Folder Permissions with PowerShell

Managing permissions for numerous servers is the theme today. Drilling down into a folder, right-clicking properties, then reviewing security on the same folder for potentially dozens of computers is time consuming and, with the capabilities of scripting, unnecessary.

PowerShell lets us do this very easily. The first script allows you to view each account and their corresponding read/write permissions on any number of computers. By default the script will only search the local computer. You can filter to only display a specific right. A full list and explanation of each right is available here.

Function Get-Permission
{
[Cmdletbinding()]
Param(
  [string[]]$ComputerName = $Env:COMPUTERNAME,
 [Parameter(Mandatory=$true)]
  [string]$Folder,
  [string]$Rights
)
Process {
 $ComputerName |
 ForEach-Object {
 $Server = "$_"
 Write-Verbose "Getting Permissions for \\$Server\$Folder"
 (Get-Acl "\\$Server\$Folder").Access |
 Where { $_.FileSystemRights -LIKE "*$Rights*" } | Select IdentityReference, FileSystemRights, AccessControlType
}#EndForEach
}#EndProcess
}#EndFunction

Now for a simple example. Remember to supply a $ instead of a : after the drive letter, as this is designed to run remotely.

#Example of Get-Permission
Get-Permission -ComputerName "COMP1","COMP2" -Folder "C$\logs\SQL"
Now that you have verified the permissions list, you might need to make some adjustments. This set command will allow you to change $Access and $Rights for a specific $Account with minimal effort across your domain.
Function Set-Permission
{
[Cmdletbinding()]
Param(
  [string[]]$ComputerName = $env:COMPUTERNAME,
 [Parameter(Mandatory=$true)]
  [string]$Folder,
 [Parameter(Mandatory=$true)]
  [string]$Account,
  [string]$Access = "Allow",
  [string]$Right = "FullControl"
)
Process {
  $ComputerName|
  ForEach-Object {
  $Server = "$_"
  $Acl = Get-Acl "\\$Server\$Folder"
  $Acl.SetAccessRuleProtection($True,$False)
  $Rule = New-Object System.Security.AccessControl.FileSystemAccessRule("$Account","$Right","ContainerInherit,ObjectInherit","None","$Access")
  $Acl.AddAccessRule($Rule)
  Set-Acl "\\$Server\$Folder" $Acl
  Write-Verbose "Permission Set for \\$Server\$Folder"
}#EndForEach
}#EndProcess
}#EndFunction
 And here is a quick example of how to execute the function. This can be used to allow or deny rights to the folder.
#Example Set-Permission
Set-Permission -ComputerName "Comp1","Comp2" -Folder "C$\logs\sql" -Account "Domain\ServiceUser" -Access "Allow" -Right "FullControl"
 

Changing Roles

I’ve made a major career change as of last month. I didn’t even get a blog posted last week due to the uncertainty and travel I’ve been doing. It’s already been a huge change, and I haven’t done much more than my new hire orientation yet!

As of last week, I am now a Microsoft employee. I’ve accepted a position as a Premier Field Engineer in SQL Server. Microsoft has been on my list as kind of a “capstone” company that I would like to work for, so when I got the chance to actually apply, I couldn’t pass it up. Working for the company the produces the product I work on will be an amazing experience, and I count myself extremely lucky to have achieved this at such a relatively young age for a SQL Server professional.

Normally this type of role would entail a great deal of travel, but I expressed my distaste for flying and the company was willing to work with me. Instead, I’ve opted to relocate about 1,000 miles, all the way to Arizona. This new experience is both exciting and stressful. It’s a new climate and a smaller town. I’m not a very outgoing person, so meeting new people here is going to be tough, and frankly I’m not even sure how to go about it. That’s going to be an ongoing challenge…

I already have, and will continue to do, a lot of flying around the country as my onboarding with Microsoft continues. The consequence (other than having to fly) is that blogs may continue to be a bit haphazard for the next month or so. Hopefully I will find some spare time between unpacking, stocking the house, and learning the area to find a good subject and queue up a stock of scheduled posts. That is the only reason I had any posts while I was moving!

I’m hoping that as I brush up on some skills and build some new test environments, I’ll have some good topics to cover in the upcoming weeks. I am very excited to start this new role that I am sure will provide me with a wealth of knowledge in the coming years.

 

Failover Cluster Manager Connection Error Fix

A few days ago I encountered a new error with Failover Cluster Manager.  A couple of servers had been rebuilt to upgrade them from Windows Server 2008 to 2012. They were added back to the cluster successfully. However, one of the servers would not open Failover Cluster Manager properly, and tracking down the solution took a long time.

The problem server successfully joined the cluster, but now it would not connect to the cluster using Failover Cluster Manager. If you opened up the application, it didn’t try to automatically connect, and manually connecting with the fully qualified name failed too. Below is the generated error.

failoverclustermanager_wmierror

I love how this error has absolutely no useful information to it. Luckily I was able to track Error 0x80010002 down online.

Research indicated that there was some sort of WMI error on the computer. Rebooting didn’t help anything, and after numerous attempts to correct/rebuild the WMI repository, not much was accomplished. Eventually, the server could connect to the cluster, but that only worked about 30% of the time, and it nearly timed out even when it did succeed! The cluster still never connected automatically.

After further poking around on the internet, I found a few suggested solutions, with my ultimate fix closely following this post. I still had to combine everything together and run scripts all over the cluster before things returned to normal.

First of all, this is a condensed version of the Cluster Query from the TechNet post linked above.

1) Cluster Query


$Nodes = Get-ClusterNode
ForEach ($Node in $Nodes)
{
 If($Node.State -eq "Down")
  { Write-Host "$Node : Node down skipping" }
 Else
 {
  Try
  {
   $Result = (Get-WmiObject -Class "MSCluster_CLUSTER" -NameSpace "root\MSCluster" -Authentication PacketPrivacy -ComputerName $Node -ErrorAction Stop).__SERVER
   Write-Host -ForegroundColor Green "$Node : WMI query succeeded"
  }
  Catch
  {
   Write-Host -ForegroundColor Red "$Node : WMI Query failed" -NoNewline
   Write-Host  " //"$_.Exception.Message
  }
 }
}

Any server that throws an error with the above query needs to have the following scripts ran on it:

2) MOF Parser
This will parse data for the cluster file. 

cd c:\windows\system32\wbem
mofcomp.exe cluswmi.mof

FCM was still not working correctly, so I reset WMI with the following command.

3) Reset WMI Repository


Winmgmt /resetrepository

That will restart the WMI service, so you’ll probably have to try running it multiple times until all the dependent services are stopped. The command shouldn’t take more than a few seconds to process either way though.

After that, the server that failed the Cluster Query (1) was reporting good connections, but FCM still wouldn’t open properly!

I decided to try the two WMI commands (2 & 3) again on the original server that couldn’t connect to FCM. I had already ran those commands there during the initial troubleshooting, so I was starting to think this was a dead end. Still, it couldn’t hurt, so I gave it a shot.

I reopened FCM and voila! Now the cluster was automatically connecting and looking normal.

As a further note, after everything appeared to be working correctly, SQL was having trouble validating connections to each node in the cluster during install, and I had to run commands 2 & 3 on yet another node in the cluster before things worked 100%, even though that node never had a connection error using the Cluster Query (1).