Hyper-V: Cluster Shared Volumes, SQL Server, & FlashArray Snapshots

Many (arguably most) are currently evaluating hypervisor alternatives to VMware, and one obvious contender is obviously Microsoft Hyper-V. And if you’re a Pure Storage customer today, one thing you may be wondering, is whether you can keep utilizing storage-array snapshots to do things like replicate SQL Server data from one VM to another. The goods news is yes, absolutely! Let’s take a look at a common scenario as an example.

TL;DR – Show Me the Code!

A PowerShell example to orchestrate the above is now available up on Pure’s Github in the SQL Server Scripts repository. In addition to having the proper setup detailed below, read the README file’s Important Usage Notes section which will outline other pre-requisites.

If you just want to take Hyper-V CSV snapshots, without SQL Server involved, check out this Github repository, where I’ve shared a PowerShell solution. This is particularly useful if you want to clone an entire VM and present a second iteration of it back to a Hyper-V cluster.

And to see this in action, watch this narrated demo recording!

Updating Non-Prod Databases with a New Prod Copy

One extremely common task and time sink for DBAs is needing to take a backup of a production database and restore it to a non-production environment. Because it is a size of data operation, it can be quite time consuming. To help with that, Pure offers solutions that I’ve showcased, that leverage FlashArray snapshots, to orchestrate such workflows nearly instantaneously. But because (anecdotally) ~90% of all SQL Servers are on VMware, our examples were written for VMware. So now I want to show you how to do it in Hyper-V.

Hyper-V: Key Tidbits Overview

In Hyper-V, virtual disks take the form of VHDX files that reside on a Cluster Shared Volume (CSV) (or generally pre-Windows Server 2012, VHD’s that reside on a Clustered Disk). While CSVs were around in Windows Server 2008R2, they only really became useful and relevant in their second iteration, released with Windows Server 2012 and this linked blog has a fantastic deep dive into the CSV stack.

If you’re familiar with VMware, then I would say that a CSV is somewhat analogous to a VMFS datastore. A VMFS datastore contains one or more VMDK files, with each file representing a virtual disk for a virtual machine. With Hyper-V, each VHDX file is like a VMDK file.

So when you take a snapshot on FlashArray, you’re taking a snapshot of the entire volume – the CSV file in Hyper-V or the VMFS datastore in VMware (unless you’re using vVols – a separate conversation for another time).

The Setup & Workflow

For this, imagine that you have two SQL Servers: SQL-Prod and SQL-NonProd. These are Hyper-V VMs and may reside on the same CSV or different CSVs, it does not really matter. The idea is that we do not want to snapshot the ENTIRE VM, but instead isolate 1 or more SQL Server user databases, to clone from Prod to non-Prod.

So what you’ll do is create two more CSV’s, one for each SQL Server VM. We’ll call them CSV-UserDB-Prod and CSV-UserDB-NonProd for sake of discussion. Inside each CSV, will be the VHDX file(s) that correspond to the virtual disks that contain your user database data files and log files. If you have say, a D:\ drive for data and L:\ drive for log, then you should expect to have two VHDX files in your CSV.

The workflow is that we will then be taking a snapshot of just CSV-UserDB-Prod, and cloning that snapshot and overlaying CSV-UserDB-NonProd, thus updating the underlying data and files. Production will not be interrupted by at all by the snapshot, but because these are volume level operations, non-Prod will be disrupted.

Thanks for reading – happy snapshotting!

Hyper-V Cluster Shared Volumes + Write Caching

This blog post will probably never apply to anyone except me, but I’m writing it anyway.

Let’s say you have some VMware VMs… and decide to run Hyper-V hosts/cluster within those VMware VMs. And let’s say you start running into some headaches, taking crash-consistent snapshots and those snapshots not coming back in a consistent state as you would expect. That’s been my headache recently.

Write Caching?

So there’s several different layers of caching that can come into play, as documented here: “Hyper-V storage: Caching layers and implications for data consistency.” So I searched high and low, to try and figure out if I had unexpected write caching set up on my Cluster Shared Volumes. Looking at Disk Management -> Properties (of a CSV), one would normally see this:

Image
Write-caching policy -> Enable write caching on the device

However… that is what one would see IF their respective Hyper-V host was on a bare-metal server. That’s NOT what one sees though, if the Hyper-V host resides on a VMware VM backed by a vVol (Virtual Volume). THIS is what you get…

Image
Note that there are TWO policies now – Removal Policy

Removal Policy?! Where did that come from? Doing some digging, that is normally only present for removable storage like USB drives. And look at the default setting’s description… “enables write caching in Windows…”

So I disabled that on each of my Cluster Shared Volumes, bounced both Hyper-V hosts (not 100% sure if it was required but did so just to be sure), and HURRAY – my crash consistent snapshot code worked as expected!!!

One reason why this was so frustrating is that I was not doing most of this investigative work via the GUI, but rather via PowerShell commands. So I was digging around things like Cluster Parameters and turning CsvEnableBlockCache and CsvEnforceWriteThrough on and off (https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-cluster-csvs & https://learn.microsoft.com/en-us/previous-versions/windows/desktop/mscs/physical-disk-csvenforcewritethrough). But nuking this Removal Policy wound up doing the trick for me.

TL;DR

If you’re trying to disable any and all write caching, for whatever reason, be sure to check if Windows thinks the disk is removable media and has a Removal Policy applied (that also enables write caching).

Thanks for reading.

Combat Database Bloat with Data Virtualization

One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!

Data Bloat Anyone?

Ask yourself, do you have a database that has a large volume of that that will never change again? Sales order history data is a generic but very relatable example. Once an order is placed, it might be amended but after a few months, it’s highly probable that it’ll never change again. We’ll pretend for the sake of our discussion that after 3 months, that data is considered set in stone.

Can We PLEASE Archive/Delete That Old Data?

Who has begged their business colleagues if they could de-bloat a database by archiving off and deleting old, ancient data from a primary production database? I see your raised hands and raise mine too. There’s many practical reasons to de-bloat a database, from performance reasons to maintenance reasons. But then there’s also the business folks that say “no we can’t because we MIGHT need to report on it.” Might… So like my stubborn beer belly fat, that data bloat remains.

Meet Parquet (… not butter)

We should all know what a Comma Separated Values (CSV) flat file. Because of its flat, singular nature, it’s not very efficient to query because the entirety of the file must be scanned, consumed, processed, etc. But what if we had a different file format whose entirety did not need to be fully processed when querying? Wouldn’t it be awesome to have a simpler flat file format, that could offer column elimination, row elimination, and compression for our data and queries? Let me introduce you to Parquet.

Pretend your sales order history table has 30 different columns of data. And let’s say you were to export all of your sales order history into individual CSV’s per calendar year. If you wanted to query say, all sales in May, June, & July of 2021 and get a SUM of total sales, you’d have to consume the entirety of the 2021 CSV file. But with a Parquet file, metadata is utilized to enable you to zero in on rows for May, June, and July only. And instead of all 30 columns, you can just retrieve sales date and sales amount ONLY, and not consume the other 28 columns! That amounts to a tremendous savings from a workload perspective!

Data Virtualization

So how does SQL Server 2022 come into play here? First, we create something called an External Data Source which in this case will point to S3 object storage somewhere. Then we will combine it with creating an External File Format for our Parquet file. The third piece of the solution is Create External Table as SELECT (aka CETAS).

The end result is that each Parquet file will now become available to us within T-SQL as an External Table entity. So I can write T-SQL to query a Parquet file and my code will look identical.

CREATE EXTERNAL TABLE parquet.SalesOrderHistory_2021
WITH (
	LOCATION = '/SalesOrderHistory_2021.parquet', 
	DATA_SOURCE = cetas_demo, 
	FILE_FORMAT = parquet_file_format_object
)
AS 
SELECT 
	[SalesID], 
	[SalesDate],
	[ProductID],
	[Email],
	[PhoneNumber],
	[OtherColumns]
FROM parquet.SalesOrderHistory_2021;
GO

SELECT SalesID, SalesDate
FROM parquet.SalesOrderHistory_2021
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Now you may have noticed that we have a singular table for 2021 data, or 1 table per Parquet file. You might be thinking, will that be a problem if I have many years worth of data, thus many Parquet files? This is where another old T-SQL trick comes into play – partitioned views. I can have multiple Parquet files virtualized via multiple External Tables, but overlay them all with a single partitioned view, to then enable me to query everything as I once did.

CREATE OR ALTER VIEW dbo.SalesOrderHistory
AS
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2020
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2021
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2022
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2023
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2024
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM dbo.SalesOrderHistory_Current
GO

SELECT SalesID, SalesDate
FROM dbo.SalesOrderHistory
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Be sure to look closely at the final SELECT in the partitioned view. That’s a reference to a traditional SQL Server table! What this means is that instead of having a dbo.SalesOrderHistory table that contains everything including new orders, I’ve physically partitioned the data out into 5 Parquet files, virtualized via External Tables, AND still included all current sales order data as well. So existing T-SQL queries against dbo.SalesOrderHistory (before a Parquet split) has no idea and does NOT need to be changed!

You Mentioned Azure & S3… That Means I Need to Cloud, Right?

Whenever people (myself included, for the longest time) hear “S3” and/or “object storage,” they typically mentally default to AWS S3 Object Storage. But let’s break that apart. In brief, “object storage” is essentially a storage architecture geared towards unstructured data like images, videos, and documents. “S3” is Amazon’s “Simple Storage Service” which I’d loosely describe as Amazon’s implementation approach for object storage. The good news here, is that you’re not just stuck with Amazon. S3-compatible object storage is also a thing. This is where other vendors have their own object storage implementations but leverage the Amazon S3 API.

But my data is on-prem and staying that way. This is where another common misconception comes in. Because “Amazon S3 object storage” has become such a common term, many do not realize that S3-compatible on-prem storage is a thing! And yes, this is where Pure Storage comes into the picture with our FlashBlade series. You can have all of this awesomeness on-prem without having to consider compromising with the cloud.

I Want To See This in Action!

Hopefully you’re chomping at the bits to give this a go. I’d like to leave you with three resources to help you get jump started:

  1. Watch… A New Solution for Data Archiving using Data Virtualization
    This is my Pure Storage vendor session from PASS Summit 2023, that covers everything end to end with demos!
  2. Learn… Microsoft’s Introduction to SQL Server 2022 Data Virtualization learning path
    Go learn straight from Microsoft!
  3. Try… Pure Storage’s Modern Storage Platforms for SQL Server workshop
    Regardless of whether you’re a Pure Storage customer or not, you can review Module 4 of this workshop.
    And if you are a Pure Storage customer WITH a FlashBlade, use this workshop to try it out for yourself!
    And if you are a Pure Storage customer but do not have a FlashBlade, you can contact your Account Team and request a Test Drive voucher for this workshop. This will give you access to a virtual lab with hardware, to work through the lab workshop.

Thanks for reading!

T-SQL Tuesday #182: Integrity

T-SQL Tuesday Logo

Welcome back to another edition of T-SQL Tuesday! This month’s blog party is hosted by Rob Farley who is asking bloggers to share some thoughts around a single word: Integrity.

To the Dictionary!

Like many words in the English language, Integrity has several meanings. I decided to look at a handful of dictionaries and am settling on this one from the Cambridge English Dictionary which states:

the quality of being whole and complete

In the realm of data, what does it mean for data to be “whole” and “complete?” Well, if I’m storing a simple value like “$14,693.31” or “fourteen thousand and six hundred and ninety three dollars and thirty one cents”… I would be rather upset if the value that wound up being stored was “693.31” or $1 ,6 3 .31″. That data is neither whole nor complete, thus the integrity of my data is compromised!

Garbage In, Garbage Out

I cannot tell you how many times I’ve encountered scenarios where “this data looks wrong.” Well… can one ensure that it is being retrieved and displayed correctly from the storage media that it resides on in the first place? Are you viewing/validating the data in question correctly? Whatever client/method you are using to review your data – that is suspect and its integrity is in question.

Assuming the answer is yes, well, did the data somehow get changed while it was “at rest” on its media? This is highly, highly unlikely – stable media like flash or even spinning disk, very rarely flip bits when the media is idle. But in extremely rare cases, okay, your storage media has lost its ability to maintain your data integrity.

But more often than not, questionable data was written to your media in an incorrect fashion. That is where the saying, “garbage in, garbage out” is definitely applicable. So, what application is writing your data?

“Oh no Andy, it’s not the application – it must be the storage that’s corrupt!”

“If the storage were corrupt, we’d be seeing random garbage data all over the place. Not just these and only these specific dollar values. So that’s HIGHLY unlikely here.”

“It’s got to be the storage is corrupt, check again.” (hours pass)… “… okay, we found a bug in our code.”

Step by Step

What’s the moral of this simplistic story here? As you determine the source of an integrity issue, you must carefully walk the stack and assess at each point in the stack, where your integrity was compromised.

But Andy, shouldn’t your database have already had other things in place like constraints, foreign keys, etc. to also ensure that all data written down maintains its integrity? Absolutely… but that’s another story for another day (that I hope another T-SQL Tuesday’s blogger tackled too).

Thanks for reading.