Practical AI in SQL Server 2025: Introducing Vector Search – Pt 1

June 5, 2025Andy (@SQLBek) 1 Comment

SQL Server 2025 is on the horizon, and one of its new capabilities is support for Vector Search.

To start off, I’ll confess that I was an initial skeptic about AI and its practical uses. I’m not a fan of overused blanket terms/technology and clickbait fads… NoSQL will eliminate all SQL databases… Cloud means the end of on-prem data centers… blah, blah, blah. But I’ve taken time to dig deeper into Vector Search, its role under the umbrella of AI, and I now believe it has practical, legitimate value.

Help Me Find Something

SQL (Structured Query Language) is all about querying data to help us find data that we want. WHERE predicates enable us to do keyword searches, sometimes with wildcards, other times with ranges, and sometimes with crazy programmatic logic.

But this is not how we humans converse on a daily basis. If I’m talking to Deborah, I’m not going to ask her questions using formal SQL Syntax.

SELECT *
FROM dbo.dinner_recipes
WHERE ingredient_list LIKE ‘%chicken%’
AND cooking_time_min <= 30
AND rating > 4.0;

Instead, I’ll communicate with her with “natural” language.

“Hey Deborah. I see we have some chicken in the fridge. What can we make for dinner with that, in under 30 minutes, that’s something that we’ve like when we’ve cooked it before?”

And through the use of natural language, I’m not constrained with how I ask my questions. This gives my questioning abilities far greater power and flexibility.

“I see we have an onion, broccoli, and cheese in the fridge, plus some sourdough bread. Any ideas on what we can do with any of those things to accompany the chicken?”

So what if I could utilize the flexibility of natural language to query the data in a database? That is where vector search plays a key role in changing our data querying and retrieval methodology.

A High Level Breakdown First

Let’s pretend I have an application with search functionality. I’d argue that most applications’ searching capabilities are constrained and limited. We can only searching via specific keywords, maybe with wildcards, ranges of values, and other fairly strict criteria.

But what if that application is upgraded with vector search capabilities? How would that work?

First, I could type a natural language question into an application. That application will encode that question into a vector; I’ll call it a “question vector” for conversation. Prior, I will have also encoded a subset of my human readable data into vectors as well; I’ll refer to that set of vectors as my “dataset vectors.”

My question vector is then compared against my dataset vectors. Vector search will find a set of dataset vectors that are very close to my question vector. This is known as a Similarty Search. That result set of dataset vectors represents the information that answers my original question. Those dataset vectors are then passed to a Large Language Model (LLM). The LLM will then translate those dataset vectors and summarize the results into something that’s human consumable.

In a nutshell, I’ve just described an AI chatbot like ChatGPT. Yes, there’s other details and nuances and extra pieces of the puzzle, this is high level introduction first.

How Can This Help Me Today?

The thing to really understand about this methodology is that it is very useful for analysis of existing data, particularly when looking for data that is “similar to X” as opposed to data that is “equal to X”.

What I’d like to encourage you to do, is to think creatively about the data in your databases today. Think about what questions can you ask of your data, that I cannot easily do so now with traditional query predicates?

Would it be helpful if you could ask of your data…

Are there any common trends throughout?

Are there any anomalies or interesting outliers?

Do we have any data that is similar to this other data?

How About a Real Business Example?

Let’s pretend you have a support ticket system. You regularly get customer tickets for various issues. Each ticket has free text, like the original customer’s complaint write-up, maybe copies of customer/support e-mail exchanges, and internal notes added by support personnel and other internal employees.

If I’m a new support person and I get a ticket about Error 5920, it’s fairly easy to do a traditional keyword search on “Error 5920.” This will help me to find prior tickets and learn how it was resolved by others in the past.

Now, what if I have a customer written description of an error scenario? Maybe it has some instructions to reproduce, but no explicit error message, how can I search that? I might try to cherry pick some key words to then search with, but that can be hit or miss.

Now, if my ticket data was also vectorized and I had vector search capabilities, I could now start asking natural language questions. “I have a customer that clicked this button on this page, and it resulted in this unexpected behavior… has anyone else reported this for build version 3.12 or later?” While any search has the potential to yield false positives, this now has greater possibility to zero in on exactly what you need.

If I have vector search, I can also do bigger picture analysis of data.

Find common issues:
For all tickets logged in the last 30 days, are there any common issues that were reported repeatedly?

Find regressions:
Compare tickets from Build 3.12, 3.5, and 4.0. Are there any new common issues that appeared in 3.12, did not in 3.5, but are now back in 4.0?

Find interesting issues:
For Build 4.0, were there any tickets with much lengthier working notes that might indicate a very complex issue was encountered?

Find anomalies:
Instead of common issues, what issues were reported five times or less, in the last 60 days?

I would strongly argue that these are all very powerful questions to ask of your data, that can be answered rapidly, utilizing the power of Vector Search.

Have an Open Mind & Think Creatively

There’s been a lot of hype and noise around AI. A lot of it is fluff. But I hope that you now have a better idea why I believe that Vector Search can be an extremely powerful tool to more quickly gain key insights into your data in ways that were not reasonably possible before. I would encourage you to think deeper about how these ideas can benefit you in your applications and day-to-day.

Stay tuned for Part 2, when I dive deeper into what a Vector is, and how Vector Search works.

Thanks for reading!

P.S.

Want to give SQL Server 2025 and Vector Search a try? Go download the Public Preview now and you can use my guide to get Ollama setup on your local machine too.

SQL Server 2025 – New Backup Magic

May 24, 2025Andy (@SQLBek) 6 Comments

SQL Server 2025 Public Preview is not even a week old, but I’m impressed with another new capability that was released – a new backup compression algorithm: ZSTD. This one came as a surprise, despite being part of Private Preview, as it was only released with Public Preview.

TL;DR – ZSTD Can Be Fast!

Today, I’m only going to do a quick blog post to share some initial test results.

This is against a new database that I’m running on a VMware Workstation VM, on my laptop, on a Samsung 990 Pro. I’m backing up my new RecipesDemoDB, which is 25.13 GB in size.

DISK = NUL (x4 files)	Speed
COMPRESSION	3057442 pages in 49.966 seconds (478.050 MB/sec)
STD – Level = Low	3057442 pages in 44.711 seconds (534.236 MB/sec)
STD – Level = Medium	3057442 pages in 33.428 seconds (714.558 MB/sec)
STD – Level = High	3057442 pages in 73.147 seconds (326.551 MB/sec)

DISK = 4 Backup Files	Speed	Total Backup Size	% size of original
COMPRESSION	3057434 pages in 80.761 seconds (295.764 MB/sec)	16.52 GB	65.74%
STD – Level = Low	3057434 pages in 39.920 seconds (598.351 MB/sec)	17.37 GB	69.13%
STD – Level = Medium	3057434 pages in 56.676 seconds (421.451 MB/sec)	15.93 GB	63.38%
STD – Level = High	3057434 pages in 94.440 seconds (252.924 MB/sec)	15.86 GB	63.10%

Observations and Thoughts

First, note that there’s now three sub-options which is the compression level. Low is the default. It’s interesting to see that in this initial (SINGLE) test, that STD-Low was fastest but the final output was slightly larger than MSXpress (the legacy compression algorithm).

And a note about data composition… ~3-4GB consists of mostly text data (source) and the remaining ~20-21GB consists of vector embeddings + corresponding text chunks. Because of the nature of vector embeddings, that’ll impact the compressibility. I’ll be releasing this new database very soon with additional supporting documentation.

I’ll be writing much more on this topic later, but wanted to share these initial findings. I find them extremely compelling and am looking forward to testing this with larger databases on higher end hardware.

Thanks for reading!

Practical AI in SQL Server 2025: Ollama Quick Start

May 19, 2025Andy (@SQLBek) 13 Comments

With the announcement of SQL Server 2025 Public Preview, hopefully you are interested in test driving Vector Search.

Microsoft has already posted demo code, but it’s only for OpenAI on Azure. But many of us are wondering about running things locally. So I thought I’d share a step-by-step of getting Ollama setup and running locally on my laptop. End-to-end, these instructions should take less than 30 minutes to complete.

If you’re on the Docker bandwagon, jump over to Anthony Nocentino’s fast start with Docker Compose instead!

My Setup

My laptop is a Thinkpad P14s Gen 5 AMD, with a Ryzen 7 PRO 8840 HS, 64GB of RAM, and 4TB of storage and running Windows 11 Pro. I also have a Razer Core X eGPU with an NVidia RTX 3070 connected to my laptop.

I run VMware Workstation on my laptop, so have multiple SQL Server VMs including a fresh one for SQL Server 2025. I want Ollama to use my eGPU, but because VMware Workstation does not support GPU passthrough (to my knowledge), I have to install Ollama on my base OS. As such, you’ll see some extra steps here that can be skipped if you’re just doing everything either locally or within a VM.

You’ll need:

Ollama – https://ollama.com/
nginx – https://nginx.org/
mkcert – https://github.com/FiloSottile/mkcert

Ollama Setup

First, just run the exe & click through the wizard. That’s easy!

Open a PowerShell prompt and run this:

ollama run llama3.2

This will download the llama3.2 model and begin an interactive session after it completes setup of the model. You can start typing and have a conversation with the model!

Once you’re done, you can exit interactive mode, but Ollama should still remain running in the background. Verify this by checking your System Tray for Ollama.

Now let’s confirm connectivity with this test. Run this in the same PowerShell Prompt:

curl http://localhost:11434

Next, you’ll want to pull an embedding model down. There are several that you can choose from: I’m choosing nomic-embed-text.

ollama pull nomic-embed-text

Now let’s test again. Run this in your PowerShell Prompt:

Invoke-RestMethod  -Uri http://localhost:11434/api/embeddings `
     -Method Post  `
     -Body '{"model": "nomic-embed-text", "prompt": "This is a test sentence to embed."}' `
     -ContentType 'application/json'

NOTE: If you’re browsing other embedding models like mxbai-embed-large, you may see that there’s a curl REST API example command. When I was first playing around, I kept getting a “405 Method Not Allowed” error with that example command. This is because curl in PowerShell is different than if it were run from Linux, which is what the doc page assumes. So I just used Invoke-RestMethod instead.

Next, we must address one annoyance that I have with Ollama – it only listens on HTTP and not HTTPS. SQL Server can only communicate over HTTPS (and rightfully so). To resolve this, we must setup a reverse proxy. This is where nginx and mkcert will come into play.

mkcert Setup

Again in your PowerShell Prompt:

mkcert -install

dir C:\Users\<YOUR USERNAME>\AppData\Local\mkcert\rootCA.pem

Next we’ll create a certificate for our Windows machine. My IP address is 192.168.1.160.

mkcert <YOUR IP ADDRESS> localhost 127.0.0.1 ::1

Go to your nginx folder, create a new subfolder called certs, and copy the two new .pem files there.

Next we’ll setup nginx. Just unzip the download and switch to that folder. We’ll now modify \conf\nginx.conf that’s in the nginx folder.

In the default nginx.conf file, there will be a large segment that starts with http and is enclosed by { } braces. Replace all of that with this (BE SURE TO MAKE CHANGES FOR YOUR ENV):

http {
    server {
        listen 443 ssl;
        server_name <YOUR IP ADDRESS HERE>;

        ssl_certificate      ../certs/<YOUR CERT FILENAME HERE>.pem;
        ssl_certificate_key  ../certs/<YOUR CERT FILENAME HERE>-key.pem;

        location / {
            proxy_pass http://localhost:11434;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }
    }
}

Save and start ngnix.

ngnix

NOTE: This will run ngnix actively within the prompt session you have open. To kill this, you’ll want to run the following command in another session.

ngnix -s stop

Later, if you want to start ngnix and run it in the background

start ngnix

Now let’s quick test again with curl. The first will go directly to ollama while the second will go through nginx to ollama

curl http://localhost:11434

curl https://localhost

Invoke-RestMethod -Uri https://localhost/api/embeddings `
   -Method Post `
   -Body '{"model": "nomic-embed-text", "prompt": "This is a test sentence to embed."}' `
   -ContentType 'application/json'

Note the differences:
* The first goes to the original HTTP URI with the Ollama port specified.
* The second goes to HTTPS but does not need the Ollama port specified because of the nginx reverse proxy is listening on 443 and passing traffic to 11434.
* And finally we’ll hit the REST API via HTTPS and create a vector embedding.

VMware Workstation Setup

Next, let’s switch to our VM and get remote access going.

First, we must import the root certificate that we created earlier to our VM.

Copy C:\Users\\AppData\Local\mkcert\rootCA.pem to somewhere on your VM.
Open the Windows Certificate Manager (certmgr.msc)
- Start -> “Manage computer certificates”
Navigate to Trusted Root Certication Authorities -> Certificates
Right click Certificates -> Import
Navigate to your copied rootCA.pem (you’ll want to change the “Field name:” field to *.pem)
Complete the wizard

Now let’s test from within the VM:

curl https://192.168.1.160

Invoke-RestMethod -Uri https://192.168.1.160/api/embeddings `
  -Method Post `
  -Body '{"model": "nomic-embed-text", "prompt": "This is a test sentence to embed."}' `
  -ContentType 'application/json'

Test within SQL Server 2025

Finally, let’s confirm that this all works within SQL Server!

First, some setup stuff.

-----
-- SETUP
USE master;
GO

-- Enable rest endpoint
sp_configure 'external rest endpoint enabled', 1
GO
RECONFIGURE WITH OVERRIDE;
GO

-- Turn on Trace Flags for Vector Search
DBCC TRACEON (466, 474, 13981, -1) 
GO

-- Check trace flags status
DBCC TraceStatus
GO

Now we’ll use some new T-SQL to create an External Model. This will communicate with Ollama’s HTTPS REST API endpoint that we took the time to create earlier.

-----
-- Create external model
USE CookbookDemo
GO

DROP EXTERNAL MODEL ollama_nomic_embed_text
GO

CREATE EXTERNAL MODEL ollama_nomic_embed_text
WITH (
    LOCATION = 'https://192.168.1.160:443/api/embed',
    API_FORMAT = 'Ollama',
    MODEL_TYPE = EMBEDDINGS,
    MODEL = 'nomic-embed-text'
);
GO

-- Verify
SELECT *
FROM sys.external_models;
GO

Now we will create some vector embeddings using the new T-SQL function AI_GENERATE_EMBEDDINGS().

The below code uses my CookbookDemo database. Of course, you’re welcome to use AdventureWorks or any other database of your choosing, but just make sure to change the final example SELECT statement.

-----
-- Test!
SELECT AI_GENERATE_EMBEDDINGS(
        N'this is some text that I want to turn into a vector embedding' USE MODEL ollama_nomic_embed_text
    ) AS GeneratedEmbedding;


-----
-- Run against a table with some data
SELECT 
    TOP 5000
    RecipeName,
    Ingredients,
    AI_GENERATE_EMBEDDINGS(
            Ingredients USE MODEL ollama_nomic_embed_text
        ) AS Ingredients_Embeddings
FROM dbo.Recipes_Flat;

Note that Ollama is using my external GPU (GPU 1 – 3D) to process the Ingredients records into Vector Embeddings! It took about 3 minutes to generate 5000 embeddings with my hardware. I do plan on looking into tweaking Ollama and my setup to improve performance. But hey, this works!

What Next?

Hopefully this quick start guide helped you get Ollama up and running, so you can start playing around now.

If you want to keep diving deeper, you’ll want to create another table with a Vector datatype, and store a foreign key plus the embedding value. Then you can start using vector search T-SQL to do some KNN searching, or create a DiskANN Index to do some ANN search. Have no idea what I just said?!? Don’t worry – I’ll be writing more blog posts to walk you through all of this!

And don’t forget to check out my other blogs on Vector Search too!

Thanks for reading!

T-SQL Tuesday #185: Video Star Edition!

April 8, 2025Andy (@SQLBek) Leave a comment

Welcome back for another edition of T-SQL Tuesday!

This month’s edition is hosted by Erik Darling and is a bit special, as instead of blogging, he’s asked us to create some video content! So here we go!

And if you want to dive into the code: https://github.com/SQLBek/sp_helpExpandView

Thanks for watching!

T-SQL Tuesday #184: My Top Mentoring Obstacle

March 11, 2025Andy (@SQLBek) 1 Comment

Welcome to another edition of T-SQL Tuesday! This month’s host is Deborah Melkin, who asks bloggers to share some thoughts around mentoring.

My Experiences with Mentoring

Over the course of my career, I’ve had the wonderful opportunity to mentor others in addition to having a mentor. It has been a fantastic experience almost each and every time too.

I value mentorship because I believe that it benefits both parties: the mentor and mentee. Some of those shared benefits result from the sharing of experiences and perspectives. For example, a mentee may bring a question to a mentor, that presents a situation from a new angle that the mentor may not have ever considered before. In general, it is an amazing experience for both parties involved and a wonderful opportunity for mutual growth, not just growth of the mentee seeing mentorship.

So Why Don’t I Do This More Often?

As I think back, I’ve mentored others far more often than I’ve been mentored myself. One might think this is rather peculiar, given how I professed the benefits of being mentored just a few moments ago.

“Andy, why haven’t you sought out a mentor more regularly? Is it because you have a difficult find finding a mentor (a commonly voiced challenge)?”

Because I hate setting goals.

For me, I don’t have a difficult time thinking of someone who I would like to have as a mentor. And I am perfectly okay ASKING to be mentored (as that is another common challenge voiced by others). My biggest obstacle is if that person says yes, then what?

One of the generally accepted best practices for a successful mentoring relationship is to try and think of some goals of what you’d like to get out of the mentoring relationship. It helps to create alignment for both parties, so they know what the time is being spent for. And that’s something I’m terrible at.

Unpopular Opinion…

In my opinion, a mentoring relationship CAN work just fine without a specifically set goal. But what’s the point then Andy?

The point is that your mentor is your trusted advisor, to help you work through whatever challenges life may throw at you that week. And just life work life for most of us, our priorities and challenges change frequently.

While I’ve framed this as an unpopular opinion, I will also fully admit that I still need to ask, but just set the expectation up front. It’s no different than say, “hey, can you help me for a few hours to develop this presentation” vs “hey, can you help me just get a better handle on my career?” And the person being asked to mentor MAY be okay with a less focused mentoring relationship. What I’m guilty of, is assuming that they would not be since it is counter to the typical best practice.

Moral of the Story… is?

Simply ask anyway. But do be clear on what your expectations are, even if you do not have any specific goals in mind. Heck, the kick-off for the mentoring relationship could be a brainstorming conversation around goals!

(I should take my own advice sometimes… )

Thanks for reading – happy T-SQL Tuesday!

SQL Server Backups & Mirror To

February 11, 2025Andy (@SQLBek) 1 Comment

I love that, while I’ve been working with SQL Server for far too many years, that I’m still constantly learning new things, even about old aspects of SQL Server.

MirrorDirectory / MIRROR TO… Wut?

About a month ago, was having a conversation with a Pure Storage customer’s SQL Server backups. They were using Ola Hallengren’s scripts but mentioned something about writing the backups to multiple locations at once. I presumed they were executing the backup operation as normal, then using another script of some sort to then copy the backup files elsewhere. That’s when I learned that Ola has a parameter called MirrorDirectory, that exposes the MIRROR TO option in a native BACKUP command.

This conversation raised the question of, what is really happening behind the scenes when this parameter is utilized? More specifically, are write I/Os done in parallel, like if you striped your backups across multiple backup files, or are they serial? Or maybe a combination there of, parallel in a backup set, but serial to each different MIRROR TO destination?

Simple Test

Finally got some spare time so I put together a simple test to dig deeper. First, here’s the backup command that I created:

-- Turn on Trace Flags
DBCC TRACEON(3604, 3004, 3014, 3213, -1);
GO

BACKUP DATABASE AutoDealershipDemo
TO
     DISK='I:\AYunBackups\AutoDealershipDemo-1a.bak',
     DISK='I:\AYunBackups\AutoDealershipDemo-1b.bak',
     DISK='I:\AYunBackups\AutoDealershipDemo-1c.bak'
MIRROR TO
     DISK='H:\Backup\AutoDealershipDemo-2a.bak',
     DISK='H:\Backup\AutoDealershipDemo-2b.bak',
     DISK='H:\Backup\AutoDealershipDemo-2c.bak'
MIRROR TO
     DISK='D:\Backup\AutoDealershipDemo-3a.bak',
     DISK='D:\Backup\AutoDealershipDemo-3b.bak',
     DISK='D:\Backup\AutoDealershipDemo-3c.bak'
MIRROR TO
     DISK='V:\Backup\AutoDealershipDemo-4a.bak',
     DISK='V:\Backup\AutoDealershipDemo-4b.bak',
     DISK='V:\Backup\AutoDealershipDemo-4c.bak'
WITH INIT, CHECKSUM, FORMAT, STATS = 3
GO

If you’re wondering about the Trace Flags, I wrote about them here.

Then to monitor I/O, I opted for the simple, brute-force method of using Resource Monitor on the SQL Server itself. So I simply executed the above backup statement, switched windows to the SQL Server, and here’s what I saw:

Simple explanation – all writes to all backup sets and backup files, were occurring simultaneously. Nothing kicked off serially. Neat!

Conclusion

So there you have it folks. MIRROR TO can enable you to write backup files to multiple locations simultaneously. But be cautious that it will take more CPU and you’ll increase the volume of write I/O, meaning you could bottleneck on your storage interconnect. Could be useful, but could become painful as your databases grow.

Thanks for reading!

Hyper-V: Cluster Shared Volumes, SQL Server, & FlashArray Snapshots

January 31, 2025Andy (@SQLBek) Leave a comment

Many (arguably most) are currently evaluating hypervisor alternatives to VMware, and one obvious contender is obviously Microsoft Hyper-V. And if you’re a Pure Storage customer today, one thing you may be wondering, is whether you can keep utilizing storage-array snapshots to do things like replicate SQL Server data from one VM to another. The goods news is yes, absolutely! Let’s take a look at a common scenario as an example.

TL;DR – Show Me the Code!

A PowerShell example to orchestrate the above is now available up on Pure’s Github in the SQL Server Scripts repository. In addition to having the proper setup detailed below, read the README file’s Important Usage Notes section which will outline other pre-requisites.

If you just want to take Hyper-V CSV snapshots, without SQL Server involved, check out this Github repository, where I’ve shared a PowerShell solution. This is particularly useful if you want to clone an entire VM and present a second iteration of it back to a Hyper-V cluster.

And to see this in action, watch this narrated demo recording!

Updating Non-Prod Databases with a New Prod Copy

One extremely common task and time sink for DBAs is needing to take a backup of a production database and restore it to a non-production environment. Because it is a size of data operation, it can be quite time consuming. To help with that, Pure offers solutions that I’ve showcased, that leverage FlashArray snapshots, to orchestrate such workflows nearly instantaneously. But because (anecdotally) ~90% of all SQL Servers are on VMware, our examples were written for VMware. So now I want to show you how to do it in Hyper-V.

Hyper-V: Key Tidbits Overview

In Hyper-V, virtual disks take the form of VHDX files that reside on a Cluster Shared Volume (CSV) (or generally pre-Windows Server 2012, VHD’s that reside on a Clustered Disk). While CSVs were around in Windows Server 2008R2, they only really became useful and relevant in their second iteration, released with Windows Server 2012 and this linked blog has a fantastic deep dive into the CSV stack.

If you’re familiar with VMware, then I would say that a CSV is somewhat analogous to a VMFS datastore. A VMFS datastore contains one or more VMDK files, with each file representing a virtual disk for a virtual machine. With Hyper-V, each VHDX file is like a VMDK file.

So when you take a snapshot on FlashArray, you’re taking a snapshot of the entire volume – the CSV file in Hyper-V or the VMFS datastore in VMware (unless you’re using vVols – a separate conversation for another time).

The Setup & Workflow

For this, imagine that you have two SQL Servers: SQL-Prod and SQL-NonProd. These are Hyper-V VMs and may reside on the same CSV or different CSVs, it does not really matter. The idea is that we do not want to snapshot the ENTIRE VM, but instead isolate 1 or more SQL Server user databases, to clone from Prod to non-Prod.

So what you’ll do is create two more CSV’s, one for each SQL Server VM. We’ll call them CSV-UserDB-Prod and CSV-UserDB-NonProd for sake of discussion. Inside each CSV, will be the VHDX file(s) that correspond to the virtual disks that contain your user database data files and log files. If you have say, a D:\ drive for data and L:\ drive for log, then you should expect to have two VHDX files in your CSV.

The workflow is that we will then be taking a snapshot of just CSV-UserDB-Prod, and cloning that snapshot and overlaying CSV-UserDB-NonProd, thus updating the underlying data and files. Production will not be interrupted by at all by the snapshot, but because these are volume level operations, non-Prod will be disrupted.

Thanks for reading – happy snapshotting!

Hyper-V Cluster Shared Volumes + Write Caching

January 30, 2025Andy (@SQLBek) Leave a comment

This blog post will probably never apply to anyone except me, but I’m writing it anyway.

Let’s say you have some VMware VMs… and decide to run Hyper-V hosts/cluster within those VMware VMs. And let’s say you start running into some headaches, taking crash-consistent snapshots and those snapshots not coming back in a consistent state as you would expect. That’s been my headache recently.

Write Caching?

So there’s several different layers of caching that can come into play, as documented here: “Hyper-V storage: Caching layers and implications for data consistency.” So I searched high and low, to try and figure out if I had unexpected write caching set up on my Cluster Shared Volumes. Looking at Disk Management -> Properties (of a CSV), one would normally see this:

However… that is what one would see IF their respective Hyper-V host was on a bare-metal server. That’s NOT what one sees though, if the Hyper-V host resides on a VMware VM backed by a vVol (Virtual Volume). THIS is what you get…

Note that there are TWO policies now – Removal Policy

Removal Policy?! Where did that come from? Doing some digging, that is normally only present for removable storage like USB drives. And look at the default setting’s description… “enables write caching in Windows…”

So I disabled that on each of my Cluster Shared Volumes, bounced both Hyper-V hosts (not 100% sure if it was required but did so just to be sure), and HURRAY – my crash consistent snapshot code worked as expected!!!

One reason why this was so frustrating is that I was not doing most of this investigative work via the GUI, but rather via PowerShell commands. So I was digging around things like Cluster Parameters and turning CsvEnableBlockCache and CsvEnforceWriteThrough on and off (https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-cluster-csvs & https://learn.microsoft.com/en-us/previous-versions/windows/desktop/mscs/physical-disk-csvenforcewritethrough). But nuking this Removal Policy wound up doing the trick for me.

TL;DR

If you’re trying to disable any and all write caching, for whatever reason, be sure to check if Windows thinks the disk is removable media and has a Removal Policy applied (that also enables write caching).

Thanks for reading.

Combat Database Bloat with Data Virtualization

January 22, 2025Andy (@SQLBek) 1 Comment

One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!

Data Bloat Anyone?

Ask yourself, do you have a database that has a large volume of that that will never change again? Sales order history data is a generic but very relatable example. Once an order is placed, it might be amended but after a few months, it’s highly probable that it’ll never change again. We’ll pretend for the sake of our discussion that after 3 months, that data is considered set in stone.

Can We PLEASE Archive/Delete That Old Data?

Who has begged their business colleagues if they could de-bloat a database by archiving off and deleting old, ancient data from a primary production database? I see your raised hands and raise mine too. There’s many practical reasons to de-bloat a database, from performance reasons to maintenance reasons. But then there’s also the business folks that say “no we can’t because we MIGHT need to report on it.” Might… So like my stubborn beer belly fat, that data bloat remains.

Meet Parquet (… not butter)

We should all know what a Comma Separated Values (CSV) flat file. Because of its flat, singular nature, it’s not very efficient to query because the entirety of the file must be scanned, consumed, processed, etc. But what if we had a different file format whose entirety did not need to be fully processed when querying? Wouldn’t it be awesome to have a simpler flat file format, that could offer column elimination, row elimination, and compression for our data and queries? Let me introduce you to Parquet.

Pretend your sales order history table has 30 different columns of data. And let’s say you were to export all of your sales order history into individual CSV’s per calendar year. If you wanted to query say, all sales in May, June, & July of 2021 and get a SUM of total sales, you’d have to consume the entirety of the 2021 CSV file. But with a Parquet file, metadata is utilized to enable you to zero in on rows for May, June, and July only. And instead of all 30 columns, you can just retrieve sales date and sales amount ONLY, and not consume the other 28 columns! That amounts to a tremendous savings from a workload perspective!

Data Virtualization

So how does SQL Server 2022 come into play here? First, we create something called an External Data Source which in this case will point to S3 object storage somewhere. Then we will combine it with creating an External File Format for our Parquet file. The third piece of the solution is Create External Table as SELECT (aka CETAS).

The end result is that each Parquet file will now become available to us within T-SQL as an External Table entity. So I can write T-SQL to query a Parquet file and my code will look identical.

CREATE EXTERNAL TABLE parquet.SalesOrderHistory_2021
WITH (
	LOCATION = '/SalesOrderHistory_2021.parquet', 
	DATA_SOURCE = cetas_demo, 
	FILE_FORMAT = parquet_file_format_object
)
AS 
SELECT 
	[SalesID], 
	[SalesDate],
	[ProductID],
	[Email],
	[PhoneNumber],
	[OtherColumns]
FROM parquet.SalesOrderHistory_2021;
GO

SELECT SalesID, SalesDate
FROM parquet.SalesOrderHistory_2021
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Now you may have noticed that we have a singular table for 2021 data, or 1 table per Parquet file. You might be thinking, will that be a problem if I have many years worth of data, thus many Parquet files? This is where another old T-SQL trick comes into play – partitioned views. I can have multiple Parquet files virtualized via multiple External Tables, but overlay them all with a single partitioned view, to then enable me to query everything as I once did.

CREATE OR ALTER VIEW dbo.SalesOrderHistory
AS
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2020
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2021
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2022
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2023
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM parquet.SalesOrderHistory_2024
UNION ALL
SELECT 
	[SalesID], [SalesDate], [ProductID], [Email], [PhoneNumber], [OtherColumns]
FROM dbo.SalesOrderHistory_Current
GO

SELECT SalesID, SalesDate
FROM dbo.SalesOrderHistory
WHERE SalesDate >= '2021-05-01' AND SalesDate < '2021-08-01'

Be sure to look closely at the final SELECT in the partitioned view. That’s a reference to a traditional SQL Server table! What this means is that instead of having a dbo.SalesOrderHistory table that contains everything including new orders, I’ve physically partitioned the data out into 5 Parquet files, virtualized via External Tables, AND still included all current sales order data as well. So existing T-SQL queries against dbo.SalesOrderHistory (before a Parquet split) has no idea and does NOT need to be changed!

You Mentioned Azure & S3… That Means I Need to Cloud, Right?

Whenever people (myself included, for the longest time) hear “S3” and/or “object storage,” they typically mentally default to AWS S3 Object Storage. But let’s break that apart. In brief, “object storage” is essentially a storage architecture geared towards unstructured data like images, videos, and documents. “S3” is Amazon’s “Simple Storage Service” which I’d loosely describe as Amazon’s implementation approach for object storage. The good news here, is that you’re not just stuck with Amazon. S3-compatible object storage is also a thing. This is where other vendors have their own object storage implementations but leverage the Amazon S3 API.

But my data is on-prem and staying that way. This is where another common misconception comes in. Because “Amazon S3 object storage” has become such a common term, many do not realize that S3-compatible on-prem storage is a thing! And yes, this is where Pure Storage comes into the picture with our FlashBlade series. You can have all of this awesomeness on-prem without having to consider compromising with the cloud.

I Want To See This in Action!

Hopefully you’re chomping at the bits to give this a go. I’d like to leave you with three resources to help you get jump started:

Watch… A New Solution for Data Archiving using Data Virtualization
This is my Pure Storage vendor session from PASS Summit 2023, that covers everything end to end with demos!
Learn… Microsoft’s Introduction to SQL Server 2022 Data Virtualization learning path
Go learn straight from Microsoft!
Try… Pure Storage’s Modern Storage Platforms for SQL Server workshop
Regardless of whether you’re a Pure Storage customer or not, you can review Module 4 of this workshop.
And if you are a Pure Storage customer WITH a FlashBlade, use this workshop to try it out for yourself!
And if you are a Pure Storage customer but do not have a FlashBlade, you can contact your Account Team and request a Test Drive voucher for this workshop. This will give you access to a virtual lab with hardware, to work through the lab workshop.

Thanks for reading!

T-SQL Tuesday #182: Integrity

January 14, 2025Andy (@SQLBek) Leave a comment

Welcome back to another edition of T-SQL Tuesday! This month’s blog party is hosted by Rob Farley who is asking bloggers to share some thoughts around a single word: Integrity.

To the Dictionary!

Like many words in the English language, Integrity has several meanings. I decided to look at a handful of dictionaries and am settling on this one from the Cambridge English Dictionary which states:

the quality of being whole and complete

In the realm of data, what does it mean for data to be “whole” and “complete?” Well, if I’m storing a simple value like “$14,693.31” or “fourteen thousand and six hundred and ninety three dollars and thirty one cents”… I would be rather upset if the value that wound up being stored was “693.31” or $1 ,6 3 .31″. That data is neither whole nor complete, thus the integrity of my data is compromised!

Garbage In, Garbage Out

I cannot tell you how many times I’ve encountered scenarios where “this data looks wrong.” Well… can one ensure that it is being retrieved and displayed correctly from the storage media that it resides on in the first place? Are you viewing/validating the data in question correctly? Whatever client/method you are using to review your data – that is suspect and its integrity is in question.

Assuming the answer is yes, well, did the data somehow get changed while it was “at rest” on its media? This is highly, highly unlikely – stable media like flash or even spinning disk, very rarely flip bits when the media is idle. But in extremely rare cases, okay, your storage media has lost its ability to maintain your data integrity.

But more often than not, questionable data was written to your media in an incorrect fashion. That is where the saying, “garbage in, garbage out” is definitely applicable. So, what application is writing your data?

“Oh no Andy, it’s not the application – it must be the storage that’s corrupt!”

“If the storage were corrupt, we’d be seeing random garbage data all over the place. Not just these and only these specific dollar values. So that’s HIGHLY unlikely here.”

“It’s got to be the storage is corrupt, check again.” (hours pass)… “… okay, we found a bug in our code.”

Step by Step

What’s the moral of this simplistic story here? As you determine the source of an integrity issue, you must carefully walk the stack and assess at each point in the stack, where your integrity was compromised.

But Andy, shouldn’t your database have already had other things in place like constraints, foreign keys, etc. to also ensure that all data written down maintains its integrity? Absolutely… but that’s another story for another day (that I hope another T-SQL Tuesday’s blogger tackled too).

Thanks for reading.