How I Spent My Last 17 Months

I recently started a new role, and I wanted to talk about what I did in my last role—it wasn’t top secret or anything, but it was somewhat different than my normal operations. I’d like to thank everyone at Designmind for giving me this opportunity, as it was a very interesting role. I didn’t leave the job because I disliked the project, I mainly didn’t see my skillset as being compatible with a firm like Cognizant, but I wish everyone there the best.

The Project

When I started at Designmind, we weren’t quite billing on the project—my role was to be data architect, I guess, but that really pretty quickly evolved into being the everything architect. While I did some other client work while I was there, nearly all of what I did during my last job was supporting this project. The project itself was a new application development project for client who aimed to build a supply chain resiliency system. We had a small team of a project manager, a data engineer, and two developers. Sadly, early in the project, we had to remove the data engineer as his skills didn’t really align to the project. That left all of the data tasks to myself and my project manager, who was a massive help. We’ll talk more about data later.

The client specified a fairly specific technology stack, AWS, Python/Flask/SQLAlchemy, PostgreSQL, Kubernetes which I was happy to adopt. They also had some specific requests around DevOps and security where we differed, and went in another direction, based on some other business requirements that the application had. While, I’ve been experienced with all of these technologies, I also had to help our developers get up to speed. We had a sample app one of the developers wrote, so at the beginning of the project, I containerized all of that code, and wrote a bunch of scripts to automatically deploy the app, on Windows, Macs, and Linux, as our target would ultimately be Linux.

Getting Started with AWS

I’ve always worked with AWS, just not as much as I’ve worked with Azure. I’ve had the good fortune to work with both vendors and customers in various AWS projects, so I had a good feel for sizing and performance. The first thing I did was to define the network configuration and building a VPN—I built everything on a private network from the beginning of the project. The basic architecture of the app was that we were going to have front-end containers running React and Nginx, connecting to a couple of middle-tier containers running Python/Flask, with a database running Postgres. Given the fairly narrow scope of this deployment, I used Elastic Kubernetes Services and Amazon RDS for PostgreSQL. For local testing and development, I instructed the team to have a local PostgreSQL instance, and Docker with Kubernetes enabled, so we could install the same stack locally, and use similar deployment scripts.

This kind of leads us into DevOps—which I’m not sure why, but fell into my lap. The client wanted us to use Jenkins, however, the ecosystem and community is dying and it was really complicated and proprietary. I have very good bash scripting skills, so using GitHub actions was a more natural fit to me. We faced a couple of challenges in building out our DevOps workflows—the first was that the builds operated differently on AWS as opposed to locally—this is easily taken care of by making a call to 169.254.169.254 which is a cloud metadata endpoint which works on all clouds, but can let you identity where you code is running. This mattered because we were using IAM authentication and other conditional deployment steps based on build location. Not just in our build process, but in our Python middle tier, I implemented conditional logic to decide how to authenticate to the database.

The Data

I’ve written here before about our data flow process and how we used some AI tooling to improve it. Our data flow and engineering process was really confusing to a lot of traditional data engineering pros I talked to about the project. The biggest issue was we didn’t have regular flows of inbound data—we had two data sets coming from the federal government that were published nightly. Those were easy—I built an Amazon glue job that downloaded, wrote them to S3, and did some degree of cleanup. That Glue job, when complete triggered a data ingestion process—Glue has its limitations, but for this data, it was fine.

The rest of our data sources were either what we could find or sporadic. From the earliest days of this project it was very obvious to me that we would have a data sourcing problem. While the federal government did supply some data, we were either going to have to gather, scrap or buy data. We ultimately bought data from a vendor, who was terrible (bad inconsistent data). I asked the vendor at one point if they could provide a delta file (a file of just changes) and they didn’t know what that was. If you know my email, and are in the data market—email me and I’ll tell you who not to buy data from. That vendor data did provide a basis for our webscraping efforts, which was pretty cool. Most of our reference data like geography, congressional districts, etc. was open sourced and downloaded in our environment. My project manager helped a lot here, identifying and vetting potential new data sources and getting us started with getting them integrating them into the app.

What I Learned

At this point in my career, I’ve been functioning as an architect since about 2013. A lot of people try to define what an architect does, and they try to do it in the context of what actual building architects do, or what some business book written by someone who’s never done the job, or written a line of code thinks an architect does. A good architect has to be ahead of the project, to understand where the priorities of both the project and client wants. The architect needs to be flexible and forward thinking. I always make culinary comparisons, but the role is a lot like being an executive chef. It doesn’t matter if your create amazing recipes (or designs), if your line cooks (or developers) don’t have the skill set to execute them. You either have to increase the skills of those workers (best), hire new workers (hardest), or dial back the recipe/design to match the skill of your team. The architect also needs to think about the problems the team is going to have next—mostly from a technical perspective, but also organizationally, or tooling. If you can get in front of those problems, you can help your developers do their jobs better.

It was a cool experience to be working on an app dev project. I was happy to get to push my skill set and help others grow their own. I would have loved to have completed the project, but unfortunately circumstances got in the way.

Reduce Your Cloud Storage Costs by Storing Files and Metadata in Parquet Files

Ever since the parquet format came out over a decade ago, it became very popular for analytics workloads. Being columnar in format, it allowed for massive scale analytics, while delivering strong and lossless compression. Various engines including Snowflake, Databricks, Synapse, SQL Server, and other databases I’m likely ignoring can all interact with Parquet. In it’s newer incarnations like Delta parquet, you can also update those files.

A young girl sitting on the floor beside a large mirror, looking playfully at her reflection in a hallway with multiple reflections extending into the distance.

There is a notion of a transaction log for each Delta parquet file–it exists in the form of JSON, and isn’t as efficient as a singular transaction log, especially for multi-table (or file) transactions. It’s not a replacement for an OLTP database, but for an analytics workload where you have to occasionally update something it works.

What I’m writing about today has nothing to do with analytics, per se. It has everything to do with cloud storage, and the way operations there are priced. Specifically, metadata operations–in the demo code I’ve shared we’re going from five files to one, but you can imagine going from a much larger number of files to much smaller number of files. You may ask–“Joey that sounds dumb, why are you reinventing zip and iso files”. Well, the main reason is that many cloud operations are priced on the number of objects–for example if you had to calculate a checksum across a number of files on S3. (For files/objects that were created before S3 automatically did checksums).

So the notion of this code, that I wanted to play with, was storing files within a parquet file. At first, I loaded 5 text files into a single parquet file. Then I added an index to the parquet file–thinking forward I added a mapping parquet file, in order to support multiple parquet files with five files each. You can see the demo in this GitHub repo. This is pretty basic code, but the notion is clear–if you have a very large number of small files, you need to store in object storage, and want to reduce that number, and potentially reduce the storage volume, you can use parquet to do it.

Scaling AI Projects: A Controlled Approach to Web Data Processing

There’s a lot of crap out there when you read about Artificial Intelligence projects, especially on LinkedIn, where I suspect half of the posts may have been created by AI bots. However, we recently implemented a process that included the use of an LLM, but in a very controlled fashion. The overall implementation process was pretty interesting, and I wanted to talk about some of the decisions I made, and why.

I obviously can’t share all the details on my current project, but at a high level it’s a custom application that we are developing. One of the biggest challenges my team has faced in the project is acquiring data to support the application. We’ve tried to engage with several data vendors, and when why finally landed with one, we weren’t very happy with the quality and depth of the data they provided. I can say we were seeking information about companies in specific sectors. The obvious answer here was “webscraping”—I don’t know if you’ve ever written code to try and scrape websites, but given there is no common standard for websites, and they are developed with a wide array of frameworks, languages and formats, it’s just a mess.

One day during our standup my product manager/data engineer, suggested that instead of using traditional webscraping, we capture images of web pages, and then feed them to an optical character recognition (OCR) model. This immediately piqued my interest—he had tried it as one off, and it seemed effective. This led me to try it out with about 100 sites—I wrote some code on my machine to scrape the 100 sites using some Python code and package called Playwright. I initially ran it through Azure Computer vision, because I have a free account with my MVP. I had the scraping code grab the home page and about us page of each of my targets.

I looked at the output and it was reasonably good—I had a CSV file with a domain name (which effectively acts as our primary key) and  a long text description of the company and what they did. My plan was to feed this to an LLM and get it to summarize the what the company did, and pull some other specific data features we were looking for. I first tried using the Azure Document Services summary tool, and that worked pretty poor. I then used Azure AI Foundry to use one of the OpenAI models to see if I would get better results. I got a lovely summary and my other data features were extracted as I expected. Now I could see this working, I had to make this work in a production environment.

I quickly threw together a script to scrape 200k websites—I decided to get smart and split the load across 8 nodes. But I cheated—I just split the file into 8 parts, and ran the Python script to do the screen scraping, I knew this was a bad idea, but it was late on a Friday and I wanted to get this going. Predictably all of my worker nodes died over the weekend, and I had to start over from square one.

I’ve been working in Linux for a very long time, so the next part of this process was fairly easy to me, but I still learned a few new things. I implemented a package called Supervisor, which let me build a cluster. I wrote some additional code to be able to easily add additional nodes, and to take advantage of using an AWS Simple Queue Services queue, to pull URLs off the queue. This gave me resumability, and scale—and because the queue maintained state, if nodes were rebooted, it didn’t impact my workload. In fact, I added an additional script running as a service on my controller nodes, which checked for unavailable worker nodes—if they were unavailable, we simply rebooted them. I ultimately scaled this scraping cluster to 36 nodes, and we completed our process in about 2.5 days.

Diagram illustrating a data processing workflow consisting of three clusters: Webscraping, OCR, and Summarization, each with control and worker nodes interacting with AWS services and S3 Buckets for data storage.

I was able to reuse the same cluster to perform the OCR and summarization tasks. Both were much less time consuming than the scraping process. I was able to get away with using eight nodes for both of those processes. The same basic idea applied—publish the data into the queue and let the worker nodes operate on them.

The summarization process is important to us—we wanted to have high quality data and avoid the risk of hallucinations that LLMs can have. I did a couple of things to reduce the risk there—I dropped the temperature parameter to .1, greatly reducing the creativity of the model. I also carefully crafted a system prompt instructing the model to only use it’s input text to create a summary of the site. I ended up using one of the Amazon Nova models—you don’t need a big cutting-edge model to summarize text and extract features. This means the inference costs were extremely low.

AI tools are best used when we tightly control the input data and put tight guardrails around the process. In this post, I wanted to demonstrate how you can take advantage of the benefits of an LLM, at a low cost. I also wanted to walk you through my process of how to scale this process, and make it into a production level process.

Shut the Front Door–How to Get It Back Open

This week Microsoft Front Door suffered another major outage. I wrote about the last outage(s) in my column at Redmond just a couple of weeks ago. Microsoft Front Door is a global content delivery network that does a number of other services for websites/APIs/endpoints. One of the challenges around Front Door is that being a global service, when it goes down, there’s no native failover process that you can easily use.

closed red wooden door
Photo by Harrison Haines on Pexels.com

Microsoft has published an initial incident report and there were some interesting details.

How did we respond?

  • 15:45 UTC on 29 October 2025 – Customer impact began.
  • 16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
  • 16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD..
  • 16:18 UTC on 29 October 2025 – Initial communication posted to our public status page..
  • 16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health..
  • 17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.
  • 17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.
  • 17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration..
  • 18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally..
  • 18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally..
  • 23:15 UTC on 29 October 2025 – PowerApps mitigation of dependency, and customers confirm mitigation..
  • 00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers..

Nothing reads too out of the ordinary for a cloud outage–but a couple of things there was around 8.5 hours of downtime for the service. The other notable thing (bolded) is that Microsoft failed the Azure Portal away from Front Door. There was some comments about this in the earlier incident report. So that brings up the question–do you need to have a plan to fail away from Front Door?

Do You To Need to Be Multi-Cloud?

I talked about this in my Redmond column, but implementing a backup solution to Azure Front Door, is inherently a multi-cloud solution. There are a few choices for global WAF solutions–not just from hyperscaler like AWS, Azure and GCP, but also CloudFlare, But if you’re application is global, has a low recovery point objective, and is critical to your business then you need to multi-cloud.

The bigger question is does your entire stack need to multi-cloud? I would argue, that at least in light of our knowledge of cloud failures–probably not. Unless you have an extremely tight SLA–you are greatly increasing the cost and complexity of the network stack. In fact, I would argue most applications don’t need this kind of highly available network stack.

In designing this I took some lessons from what I think the Azure Portal team has done–I suspect they have their servers behind Application Gateways and Front Door interacts with those applications

Diagram illustrating a global web content delivery and load balancing architecture involving TM-Failover, FD-Global, AppGW for US West and East regions, and Cloudflare.

The basic notion is we use Azure Traffic Manager with priority routing, the Front Door instance pictured here would be the initial fallback. That gives us some degree of protection against Front Door failures, and that approach seemed to work for the most recent outages. However, there was a lot of downstream DNS issues in other Azure services that raised concerns. For example, you could login to the portal, but a symptom was that you could only see Resource Groups, but no other resources.

Cloudflare comes into play here, presuming you can’t make any app updates, or your app gateways go sideways. You could recreate all of your Front Door functionality and have easy failover, in effectively a completely different provider. That doesn’t help if Azure were to go completely down, but we haven’t seen an outage like that, since the great certificate expiration failure of 2013. Generally speaking failures are limited to regions–these Front Door outages are exceptions to that rule, as Front Door is a “global” service and isn’t homed to a single region (which makes the recent outages more infuriating).

Traffic Manager being in Azure is a concern to me–I put it into this architecture because it’s relatively easy to configure, but being in another cloud for DNS could be a good option. Both Google (Google DNS) and AWS (Route 53) have services that allow for multiple IP addresses and failover based on health probes, or you could use a service like DNSMadeEasy to also handle this. DNS is really the ultimate challenge in any sort of a multi-cloud scenario–where do you put it.

There’s a lot more detail here than in a normal blog post, and yet, I held back a lot of detail. There’s a method to my madness, I’ll be publishing a white paper and doing a webinar with my friends from Denny Cherry and Associates Consulting, John Morehouse and Denny Cherry, to discuss pros and cons, and detailed configurations for how to make your applications more resilent when the Front Door closes as they say. Look for more details on that over at dcac.com/ in the next few weeks.

Enhancing Group Security Improvements in Azure PostgreSQL

I’m a big fan of using cloud services if you are going to use open source databases like PostgreSQL or MySQL. The cloud services abstract away a lot of the messiness around high availability and backups that are commonly associated with, well frankly clustering on Linux. (I’ve built some really nice MySQL clusters on Windows Server Failover Clusters, believe it or not). They also have some value added features that you can’t easily get running your own solutions–in the case of Azure, that would be the query store and Entra authentication (amongst other features like AI connectivity).

Postgres 18 adds built-in support for OAuth, but the experience can still be a little rough around the edges. As I’ve mentioned here in the past my current project runs on Amazon RDS, and while we do use IAM auth, getting it up and running was a couple of days of work, particularly around making Oauth work with SQLAlchemy, the ORM we are using on the project. What made that harder, was that we couldn’t use OAuth in our local dev environments, so all of the code I wrote had to be conditional based on whether it was running in a cloud or not (thank you https://169.254.169.254).

Entra (the artist formerly known as Azure Active Directory) authentication for databases has come a long way. I remember in Azure SQL Database, when it first launched, it was an absolutely ordeal to configure, which I somewhat appreciate as it forced me to learn a lot of intricacies of the authentication service. Azure PostgreSQL similarly had a multi-step process. Fortunately, things have improved for the better and enabling Entra auth is simply clicking a radio button in the Azure portal, or a flag in your Terraform/Bicep/PowerShell code.

One of the limitations of Azure PostgreSQL’s Entra integration was group login. The login process for members of a group required the user to user the group name as their login id, and get a bearer token which was used as the password.

Terminal output displaying an access token request for Azure, highlighting JSON structure with parameters such as 'accessToken', 'expires_on', and 'subscription'.

One logged in, the user was shown in Postgres system views as the group name.

A screenshot of a PostgreSQL database session showing active connections, including user IDs, application names, client IP addresses, and timestamps.

As you can imagine, in firms that have lots of regulations and auditors, this could problematic. Well this week, Microsoft fixed this problem–there is a new server parameter for your Azure PostgreSQL servers, called pgaadauth.enable_group_sync.

Screenshot of Azure Database for PostgreSQL server parameters showing the 'pgaadauth.enable_group_sync' parameter to enable synchronization of Entra ID group members.

After enabling this parameter, you can wait 30 minutes, or call the function it uses

 SELECT * FROM pgaadauth_sync_roles_for_group_members();  

And your group membership will be synced with your PostgreSQL server. The docs on this are still a bit of a work in progress. They are here–but let me give you a quick walk through because I was confused.

  • The only real change to the login process is that instead of using the group name as your login (like above where I used PG_DBA), you are using your EntraID that is the member of the group.
  • You still need to authenticate to Azure/Entra using your favorite CLI, and get the bearer token value to use as a password.
Screenshot of the PGDemo connection settings for PostgreSQL, displaying fields for host name, port, maintenance database, username, and Kerberos authentication toggle.

Now that I’ve logged in as a group member, I can see that I’m logged in as [email protected] who only has access through the PG_DBA group.

Table showing connection information in a PostgreSQL database, including process ID, username, application name, client address, and backend start timestamp.

This is big improvement–while using Oauth based authentication to Postgres still isn’t as easy as SQL Server, we now have similar levels of audibility, which is a huge help, even to a non-regulated organization.

In Defense of Kubernetes

I’ve seen a couple of posts (of course they were chock full of AI slop images) on LinkedIn in the last couple of weeks, talking about how challenging it is to implement Kubernetes. In the most recent post I saw, it stated that “it took 5 months for our CEO to implement Kubernetes for our app”, to which I would ask, why the hell is your CEO configuring your clusters. I designed, and implemented the Kubernetes infrastructure on my current project, and I’ve worked on for a while, so of course, I felt the need to share my opinions on the matter.

If you are trying to build bare metal Kubernetes (are you also compiling your own Linux?), it is probably pretty difficult. If you are like the rest of the word, just use your preferred cloud providers Kubernetes distribution (Azure Kubernetes Service or the Elastic Kubernetes Service on AWS) and run with that. Even if you suck at AWS security like I do, you can get this up and running in a couple of hours. I’d even say you could run it on-prem on VMware, but Broadcom threatened me with a lawsuit for saying that without a license. After that, you really don’t have to think about Kubernetes that much other than deploying containers to run as pods.

Yes, this does mean your developers have to learn YAML (or ask an LLM to make it for them), which they should already know, understand how containers works (which I hope they do already), and learn a few organizing things about security and labeling in K8s. But after that Kubernetes handles auto-scaling (especially if you checked the auto-scale box in your cloud provider), does a lot of heavy lifting for networking in your microservices app, and provides a pretty good level of high availability.

Kubernetes has about the same level of complexity as your average cloud deployment, and the infrastructure as code scenarios are far simpler. If you just think of it as VMware, but for containers, and roll with changes, you’ll have a good time and gain a lot of functionality for a bit of work.

SQL Server 2025 Release Candidate 0 Drops: Big News–Vector Search Works on my Mac

This morning, SQL Server 2025 Release Candidate 0 was released. You can see the new features list and details here. Of course, I’ll be writing more about the forthcoming release in the next several months both here and in my column at Redmondmag.com. However, I wanted to get a quick post about something that made me really excited. In my column over at Redmond, I mentioned that the Vector Search functionality was not available on my Mac, because of some problems with Rosetta, Mac’s x86 translation layer.

Well, big new this morning. Using Anthony Nocentino’s sample code here I got Vector search running locally on my Macbook Pro.

Screenshot of a SQL Server 2025 environment displaying code for utilizing vector search functionality, along with terminal output showing container status and configuration on a MacBook Pro.

Yes, that’s Azure Data Studio (Azure Data Studio for life, or until it stops working for real, or we get SSMS on ARM). And that’s me doing a vector query against my RCO container running all on Apple Silicon. Great work SQL team, and I hope the rest of you have fun testing SQL Server 2025–stay tuned for more exciting news.

Security, AI, and Databases–AI in SQL Server 2025

I once heard a product manager say “we’ll do security after go-live” to describe an early phase product’s glaring lack of granularity of security controls. As anyone reading this, who has written the tiniest piece of software, you will know that “doing security after go-live” is a complete recipe for disaster. One of the challenges of emerging technology is that security is never sexy–it’s overhead, it doesn’t demo well to investors, and in a telemetry drive software world “we only see 10% of customers using this advanced security feature” is a common refrain. Which brings us to AI–I saw this post from Scott Hanselman on BlueSky last week, and had a good laugh:

The S in MCP stands for security— Scott Hanselman 🌮 (@scott.hanselman.com) August 2, 2025 at 1:29 AM

If you aren’t old, this joke goes back to the early days of the Internet of Things, where the joke was “the S in IoT stands for security”. MCP stands for “Model Context Protocol” which is the defacto standard for AI model access and communications. If you want a good read on the many flaws with MCP, read this excellent post by Julien Simon, detailing how MCP ignores 40 years of learnings from distributed systems. Where have we seen that go well before?

Image
Photo by Life Of Pix on Pexels.com

This brings us back to the security risks around AI. GIS Consultant, Faine Greenwood posted on BlueSky: “ChatGPT is probably the biggest honey pot of willingly-turned over highly confidential information that has ever been created in human history.” A follower replied:

I was the Director of IT at an org and one day, found out our CFO was putting all our MOUs into his personal ChatGPT account and HR was having conversations with a personal ChatGPT account to determine what salaries we should be offering staff. Very concerning!— Toneloaf (@toneloaf.bsky.social) August 11, 2025 at 2:33 PM

MOU=memorandum of understanding.

You don’t have to be a distributed systems expert to understand why it’s a terrible idea to paste sensitive business data into a third party system that you have no security controls over, or have a data sharing agreement. OpenAI will use your contracts to train their future models–and that’s if nothing happens that’s even more nefarious. Or if OpenAI has a data breach, but that could never happen.

Doing AI, But with Security

My article at Redmondmag.com this month is about SQL Server 2025’s AI capabilities. One of Microsoft’s selling points to the AI model in SQL Server is that your AI model can be self-hosted, in your own public cloud environment, or using a third party. This gives the IT organization the controls needed to ensure that sensitive business data stays within a controlled environment. Beyond that, you control all access to your data, using the robust, mature model of SQL Server security.

This gives you a few ways to have control–you have fine-grained access controls, like providing tools like row-level security, column security, and features like dynamic data masking to protect sensitive values in user viewed data. SQL Server audit allows you to track all of your inputs and outputs at the database level. By bringing AI into a database with robust security controls, you leverage a mature security model. Having full control of which AI models you are using, and more importantly where you are running them, gives you full ownership of the data flow in your AI pipelines.

Anytime we have a large technology hype cycle, security always gets put on the back burner. Wait, as a cook, that’s the wrong way of looking at it–security never makes it onto the stove, it’s just an onion, sitting in a walk-in somewhere. Whether it was the early era of cloud (remember when the only two Azure roles where Admin and Co-Admin), big data, or now AI, technology companies tend to worry about security later rather than sooner. This is one of the major pain points of being on the bleeding edge of technology. Leverage a robust security model, like SQL Server, can help you leverage new technology, while protecting your data.

The DBA’s New Nuclear Weapon–ABORT_QUERY_EXECUTION

Yesterday on the r/SQL subreddit there was a post about someone, who’s boss was using ChatGPT to generate queries against an operational database, and in shocking news, the queries sucked and were causing large amounts of resource contention. While this is very much a 2025 problem, developers and their applications executing terrible code, is a common problem that has helped me pay off my house early. The first time I really remember this was when I fixed a vendor’s Oracle application by adding a couple of indexes which dropped CPU by like 80%. I felt like a superhero. The other one I remember was supporting a vCenter database, back when it still ran on SQL Server.

grayscale photo of explosion on the beach
Photo by Pixabay on Pexels.com

VMware vCenter was absolutely hammering my shared SQL Server FCI with IO, (yes this was in 2010, if you were guessing), and I wanted to limit what it could do. Since we were on SQL Server 2008, I had the newly released resource governor feature. In 2008, I couldn’t use that to directly limit IO requests, but I could cut the CPU requests to the vCenter login enough that it couldn’t get that much IO. This step was admittedly draconian, but in shared environments you do what you need to do.

Resource governor is a powerful feature, and it works quite well, however, it can be tricky to implement. It’s classifier function is limited to things that can be captured at login, like login name or program name, as opposed to being able to scope it to a database. However, in most cases if you have a shared server resource governor is the best way to balance resources between workloads of different criticality.

However, SQL Server 2025 gives us a bigger hammer (DBAs love hammers). Building on top of the query store hints feature that was added in SQL Server 2022, ABORT_QUERY_EXECUTION simply blocks the exection of known problematic queries.
When you specify this hint for a query, an attempt to execute the query fails with error 8778, severity 16, Query execution has been aborted because the ABORT_QUERY_EXECUTION hint was specified.

You can implement this using the following T-SQL block.

EXEC sys.sp_query_store_set_hints
     @query_id = 39,
     @query_hints = N'OPTION (USE HINT (''ABORT_QUERY_EXECUTION''))';

I wouldn’t go around using this reguarly, but if you have identified some really terrible queries, maybe that are coming from an older application, or a rogue user, this can be a good way to kill it very early in the optimization process.

MVP Renewal and One Extra Thing

I have had the distinct privilege to be a member of the Microsoft MVP program for the last 11 years. If you aren’t familiar with the MVP program, it exists to provide technical community members a channel to work more closely with Microsoft in testing new software, understanding requirements, and trying to build communities. MVPs are awarded based on their technical community activities like speaking, writing, and organizing events. The program has given me opportunities to speak around the world, and meet some of the smartest people I’ve ever me.

When I was originally awarded as an MVP in 2014, it was for SQL Server, which through the advent and maturity of Azure morphed into the vague Microsoft marketing term “Data Platform”. MVPs are awarded by their technical area–in most cases MVPs are awarded in a single technical area. Which brings me to this bit of news–today I was awarded for both Azure SQL and Azure Compute. Note in more fun marketing terms, Azure SQL falls under Data Platform, while Azure Compute falls under Azure.

Email from Microsoft congratulating Joseph D'Antoni on his continued membership in the MVP program, highlighting his contributions in Azure SQL and Azure Compute.

I’ve made a big push in recent years to become more deeply knowledgeable about topics like networking, storage, and security in Azure, including the video series I do with John Morehouse from DCAC called The Azure Cloud Chronicles. There aren’t a lot of dual-MVPs so I am happy to have been recognized by Microsoft for this. Congrats to all the other renewed MVPs.