Stories by Striveworks on Medium

Soldier-Built Models on Striveworks’ AIOps Platform Accelerate the BDA Process in Ivy Sting 2

Striveworks — Wed, 26 Nov 2025 16:19:54 GMT

The complexity of modern conflict demands decision superiority at machine speed. But a major bottleneck remains for commanders: Battle Damage Assessment (BDA). This process — frequently cumbersome, manual, and risky — slows down the fighting force’s operational tempo. Striveworks is changing that.

As a core partner in the US Army’s Next Generation Command and Control (NGC2) prototyping effort, Striveworks is delivering the AI platform and model catalog necessary to manage and deploy AI models from the cloud-native enterprise to resource-constrained infrastructure at the tactical edge. At the recent Ivy Sting 2 exercise, these capabilities were used by the 4th Infantry Division (4ID) to make BDA not just faster and more rigorous but also automated and instantaneous — fundamentally transforming how commanders assess the battlefield.

Soldiers fire an M777 howitzer during 4ID’s Ivy Sting 2 command post exercise, simulating wargaming and leveraging the Next Generation Command and Control (NGC2) ecosystem. Photo Credit: Sgt. William Rogers

Setup and Deployment

During Ivy Sting 2, 4ID Soldiers used curated data to train and deploy a computer vision model specifically for BDA. The ability for soldiers to quickly build a model tailored to their immediate use case is critical, but the time and effort to manage data collection, data labeling, model training, and model test and validation have historically made this impractical. By accelerating and automating these processes, the Striveworks platform enabled the 4ID soldiers to innovate and customize AI models to specific operational needs within the exercise.

These models were deployed onto compact, tactical edge computers. These systems were installed into Stryker combat vehicles and connected to a tactical network, aligned with the exercise’s operational infrastructure.

Detection and Target Nomination

The Striveworks AIOps platform ingested live video from the NGC2 data layer that originated from Anduril’s Ghost-X. The feed was monitored in real time by the AI models. When the models detected objects of interest linked to commanders’ priority targets, these objects were nominated by Striveworks’ Sky Saber as tracks through direct Lattice integration. This automated nomination process refined the targets for the commanders’ consideration on potential engagement.

Engagement and Automated BDA

Once the commander considered and approved targets for engagement, they were subsequently fired on by 4ID artillery. Subsequent data collection on the target was consumed by the AI models. Successful hits were then communicated to commanders from within Sky Saber via the data layer, reducing the need for Soldiers to manually generate BDA. This streamlined, automated process greatly reduces the time and uncertainty associated with BDA, speeding planning for and execution of follow-on operations.

Key Takeaways for Leadership

The Ivy Sting 2 exercise underscored a strategic shift in how the Army can leverage AI. By employing Soldier-built, edge-deployed models on Striveworks’ AIOps platform — and integrating real-time insights through Sky Saber — the 4ID demonstrated that BDA can be automated, accurate, and immediate.

For senior leaders, the implications are clear:

Rapid Decision Cycles: Automated BDA dramatically compresses the time between engagement and assessment, enabling commanders to act faster with greater confidence.
Force Multiplication: Soldier-built models scale across formations, empowering units to tailor AI tools to their mission without relying on external engineering support.
Operational Resilience: Running on resource-constrained, survivable tactical-edge systems ensures that critical AI capabilities remain available even in degraded or contested environments.
Better Use of Finite Assets: Faster, more reliable BDA strengthens the commander’s ability to allocate fires, maneuver forces, and sequence follow-on operations for maximum effect.
Path to Full-Scale Adoption: Ivy Sting 2 proves the architecture works. Subsequent NGC2 exercises will build on this success — expanding automation, refining track correlation to reduce cognitive load, and integrating additional data sources to give leaders a clearer, more decisive operational picture.

In short, Ivy Sting 2 showed that AI-enabled BDA is no longer theoretical. It is operational, repeatable, and ready to scale — delivering the decision superiority required for modern, multi-domain conflict.

Model Remediation: The Solution to The Day 3 Problem of Model Drift

Striveworks — Wed, 17 Sep 2025 20:45:01 GMT

Model failure is a huge concern for any AI-driven organization. The phenomenon that causes models to stop working in production-known as model drift, model degradation, or The Day 3 Problem -is a massive drain on resources.

Our recent blog post on model drift breaks down the details of this problem: what it is, why it happens, and how it ruins the utility of AI in production. But it doesn’t explain what to do about it.

You can’t prevent model drift-but you can overcome it. In fact, controlling and responding to model drift is essential for long-term success with AI.

So, how do you stop model drift from tanking your AI projects? You take the following steps:

Identify model drift as soon as it occurs.
Retrain your models so they understand incoming data.
Redeploy those models in hours-not weeks or months.

Together, these steps are known as model remediation-the key to unlocking sustained value from your AI program.

On the left, an object detection model in production has misidentified a street sign as a person. This situation calls for model remediation. By working through the process discussed below, users can rapidly retrain AI models to reduce errors — removing the wrong inference and delivering trustworthy object detections, as seen on the right.

What Is Model Remediation?

Model remediation is the process of restoring a failing AI model to good health and returning it to production. By starting remediation at the first sign of a problem, you stem the flow of bad inferences and-ultimately-get more effective uptime and return on investment (ROI) from your model.

On the surface, remediation seems simple: Catch a model drifting, retrain a new model, and then swap the new one for the old.

In reality, it’s more complicated. Many questions factor into each step of the process:

How do you know if your models are drifting?
If they are, how can you tell if they need rapid remediation?
Where do you get the data to retrain them?
How do you trust that a newly trained model will perform better than the old one?

Fortunately, a structured process and tooling can automate much of the work for model remediation, making it easier for data teams to get effective models back into production fast.

Key Takeaways:

Model remediation is the process of fixing a broken AI model and returning it to production.
It is instrumental to maximizing the ROI of your AI models.
Although remediation is complicated, a streamlined process and tools can automate much of the remediation workflow.

Why Do Models Need Remediation?

Sooner or later, most AI models-91%, according to this Scientific Reports article-start to develop problems in production. Model remediation is simply the process of fixing those problems so that your model continues to work the way you want.

That said, not all AI models need remediation-at least not an emergency, “all hands on deck” approach to it. Drifting models may continue to perform well enough for your purposes, at least in the short term (such as recommendation engines for streaming services). Others may deliver inferences that are only valuable for a few seconds-object detection models used in self-driving cars, for example.

But models that produce critical, persistent, or time-sensitive outputs need to maintain the highest level of performance possible. AI models for high-speed trading, medical diagnostics, and national intelligence applications all need quick remediation to continue generating safe and trustworthy results.

Key Takeaways:

AI models need remediation to restore their performance, which commonly degrades in real-world applications.
Models that produce critical, persistent, and time-sensitive outputs (financial trading, medical diagnostics, national defense, etc.) need remediation to occur rapidly.

How Have Data Teams Remediated Models Until Now?

Until recently, most data teams lacked a standardized or streamlined process for model remediation. With so much emphasis on building and deploying models, remediation was often treated as an afterthought. When models did need fixing, the process involved a lot of manual coding and data curation that made it take longer than necessary.

But the increasing adoption of AI applications across business and government has created a need to keep models performant. With millions of dollars and lives on the line, enterprise leaders can no longer let their AI projects limp along or sit idle without a working model. Organizations now depend on this technology, and they have new expectations and standards of quality that require effective models-which means they need a process for remediating their models when they inevitably start to struggle.

Key Takeaways:

Data teams have long treated AI post-production-including model remediation-as an afterthought.
Widespread adoption of enterprise AI has made model remediation a high priority for data teams.

What Steps Are Involved in Model Remediation?

Once an AI model goes live, one of three things can happen.

The model continues to perform well forever.
Drift degrades the model’s performance over time.
The model fails right off the bat.

Automated monitoring tools can identify when model drift starts to occur. (At Striveworks, we use statistical tests to determine whether or not a model is experiencing drift.)

When monitoring detects drift and data teams decide the predictions are not sufficiently accurate, they need to kick off a remediation workflow.

Model remediation consists of a five-step process:

Evaluate the data: Determine if the problem is caused by your data or your model.
Curate a dataset: If your model needs attention, source training data that more closely matches the data your model is seeing in production.
Train a model: Use your new training data to fine-tune your existing model-a process that’s much faster than training a new model from the ground up.
Evaluate the model: Test your new model version to confirm that it exhibits appropriate performance.
Redeploy the model: Push your remediated model back into production.

Carrying out each of these steps manually takes days or weeks of effort-if you can do them at all. Tools that support these steps-automatically creating a training set from production data or centralizing evaluations in a persistent, queryable evaluation store — enable data teams to take a production model from inadequate to exceptional in a matter of hours.

Model post-production starts with an automated process for detecting drift occurring on ML models. Two standard multidimensional tests — Kolmogorov-Smirnov and Cramér-von Mises — are widely used to confirm whether production data is out of distribution with models’ training data. If the incoming data falls below critical thresholds, a data team then intervenes to evaluate if, indeed, model drift is underway. If so, teams need a new, more applicable dataset to fine-tune their model — a process that takes notably less time than training a new model. Remediated models are then re-evaluated for efficacy and redeployed into production, where the cycle begins again.

Key Takeaways:

Automated monitoring can detect when a model stops working.
Manual model remediation can take days or weeks of effort.
Tooling that supports key remediation steps, like evaluation and dataset curation and model fine-tuning, can cut that time to a matter of hours.

How Do I Tell If the Problem Is With My Data or My Model?

When you detect drift happening to your AI models, two important questions crop up:

Is the problem with your data pipelines?
Is the problem with the model itself?

Both data and model problems can lead to model drift. The first step in the model remediation process-evaluate the data-is to figure out which problem you’re having.

Fixing Problems With Data Pipelines

If your problem resides in your data, fixing it can pose a challenge. Problems with data can take many forms and can come from anywhere in your machine learning operations (MLOps) workflow:

Maybe all your training data was inappropriate for your use case
Maybe some segments of your dataset worked great, but they were thinner than expected (resulting in overfitting)
Maybe your data was appropriate, but its resolution (in the case of a computer vision model) was too low to work for incoming data
Maybe an upstream model changed and it now transforms your data in a completely different way
Maybe you’re dealing with a combination of these issues

For the best chance at solving a data problem, you really need an unobstructed view into your data lineage. If you have auditability over your data, you can comb through it and search for anomalies or errors-mislabeled datums or upstream data sources with new parameters. Then, you can apply a point solution to the specific issue-like sourcing new training data or patching an API.

Unfortunately, data auditability is still immature in most machine learning workflows. Data teams can often examine part of their data pipeline but not all of it-especially with deep learning models that tend to operate like black boxes. Even to get that basic understanding of data, engineers typically need to alter their software or write a lot of custom code, especially when their workflows involve calls to external data services. That process just isn’t scalable.

“Custom coding for data lineage is simply insufficient,” says Jim Rebesco, Co-Founder and CEO of Striveworks. “Today, any organization can ensure that their AI-powered workflows have automated processes that can alert nontechnical users to identify, confirm, and report errors.”

Fortunately, standards are starting to change. Striveworks, for instance, has developed a patented process that gives engineers access to their full data lineage, including calls to external services.

Fixing Problems With Models

Compared to fixing data pipeline problems, fixing models appears straightforward: You retrain your current model or replace it with a new one. But it’s more complex than it seems.

What caused the problem with your model? Did your incoming production data move out of distribution with the model’s training data (i.e., did the data drift?)? Or did its use case shift and render the results of your model inadequate (i.e., was it concept drift?)?

According to Daniel Vela and his coauthors in their 2022 Scientific Reports article, “Retraining a model on a regular basis looks like the most obvious remedy to [model drift], but this is only simple in theory. To make retraining practically feasible, one needs to, at least:

develop a trigger to signal when the model must be retrained;
develop an efficient and robust mechanism for automatic model retraining; and
have constant access to the most recent ground truth.”

Automated monitoring serves as that initial trigger, letting data teams know when models need retraining. Deploying a ready-to-go model is as simple as loading it onto the inference server and connecting any APIs.

But if you don’t have a better model ready, you need to train one on data that’s more appropriate for your actual use case. Finding that data is tricky. If you had better data, you probably would have used it to train your model in the first place.

Of course, if you had a model in production, you do have better data: your production data itself. You just need to have captured it for use in creating a new training dataset. Not all MLOps platforms support the persistent inference storage needed (although Striveworks does). But using the data from your actual model in production is still your best bet for updating your model into one that works for your use case.

Fortunately, you don’t have to start from scratch. Your current model is often still a viable option-it just needs some adjustment. Retraining your model on a new dataset takes much less time than training a whole new model. You already have model weights loosely dialed in. Your model really just needs fine-tuning to improve it for your specific use case. Instead of processing 100,000 new data points, you can zip through a highly relevant dataset-improving its performance in a few hours, instead of the several days it may take to train a brand new model.

Key Takeaways:

If your model is drifting, you either have a problem with your data or a problem with your model.
Problems with data are often difficult to identify; access to data lineage is vital.
Problems with models are resolved by replacing the model with a new one.
Using your own production data is a great way to curate a dataset that works ideally for your use case.
Retraining an existing model is much faster than training a wholly new model.

How Do I Know My New Model Will Work Better Than My Old Model?

After you retrain your model, data teams need to reevaluate it to confirm that it is better tuned to handle your production data. Even if you are using a world-class dataset, problems can arise during training.

“When remediating, you want to ask some hard questions,” says Rebesco. “Is this model better than the old one? Will it continue to be better? What are its failure modes? Is there an ability to evaluate and compare the models with other ones?”

Maybe your model responds well to most segments of your data but not all. Maybe you selected a set of hyperparameters that resulted in an overall worse model. The only way to make sure your model will perform well is to evaluate it.

Historically, data teams would often evaluate their models ad hoc-running an evaluation in a Jupyter Notebook and saving the results in an email, Slack thread, or Confluence page. Obviously, this process doesn’t scale. But new tools are now coming online to support the model remediation workflow with standardized, centralized model evaluations. Striveworks’ Valor is an open-source option available on our GitHub that streamlines evaluations as part of your AI workflow. Teams can use it to easily compare the suitability of their remediated model versus other models and to check that their new model is better aligned with their production data.

Key Takeaways:

You need to reevaluate your model after retraining to confirm that it will work better.
A centralized evaluation service like Striveworks’ open-source tool, Valor, lets you evaluate models consistently at scale.

Model Remediation Is the Next Link in the MLOps Value Chain

It may help to think of model remediation as the next step in the MLOps value chain.

“In the MLOps space, everybody has been saying, ‘You can build and deploy models.’ Of course that’s valuable, but even the language people are using is wrong,” says Eric Korman, Co-Founder and Chief Science Officer of Striveworks. “They need to think about the full life cycle of a machine learning model-build, deploy, and maintain.”

Before now, data teams would stop after getting a model into production. The standard workflow involved the following steps:

But that process has left a lot of AI models breaking in production-and a lot of data teams left holding the bag. Model remediation completes the real life cycle of AI models, taking them through degraded performance and back out again.

It also lets organizations respond to the shifting value center of MLOps. The greatest value from AI no longer comes from putting more and more models into production; the opportunities for value generation come from keeping those models performant longer, which enables AI to become a trusted part of your tech stack. Model remediation is the process that makes it happen.

***

Frequently Asked Questions

What is model remediation?

Model remediation is the process of restoring a failing AI model to good health and returning it to production. It is a necessary step in getting long-term value from AI models.

Why do AI models need remediation?

91% of AI models degrade over time. To make them work effectively again, data teams need to diagnose the reason behind that degradation, retrain their models, reevaluate them to confirm effectiveness, and redeploy them. That process is model remediation.

How do I remediate a failing AI model?

To remediate a failing AI model, you need to follow this five-step process:

Determine if the cause of your model failure is your data or your model.
If your model needs attention, source training data that more closely matches the data your model is seeing in production.
Use your new training data to fine-tune your existing model-a process that’s much faster than training a new model from the ground up.
Test your new model version to confirm that it exhibits appropriate performance.
Push your remediated model back into production.

With enough time and resources, data teams can execute these steps manually-but an MLOps platform makes the process faster and scalable.

What happens if I don’t remediate my AI model?

Certain AI models don’t need rapid remediation because their inferences are nonessential or transient. However, AI models that are used in critical workflows for medical diagnostics, high-speed trading, national defense, and other consequential scenarios need to remain high-functioning as consistently as possible. Without remediation, these models deliver wrong inferences that lead to bad or even catastrophic outcomes.

To Build or Buy an AIOps Platform?

Striveworks — Wed, 17 Sep 2025 20:08:58 GMT

It’s the oldest debate in business (or at least the oldest since Silicon Valley invented software-as-a-service).

“If we need software, should we buy something off the shelf or try to build it ourselves?”

That’s a tricky question. Dozens of variables can sway the ups and downs of buying or building software. The right choice can lead to hockey-stick growth for your bottom line, kudos from your board, and high-fives from grateful employees. The wrong choice can leave you slack-jawed, wondering how you got into such a mess.

The stakes are even higher when the question involves AI and machine learning (ML)-technologies that are transforming the world but changing rapidly as they do it. With AI, new solutions and vendors pop up every day. Making a choice can feel like playing three-dimensional chess.

So, what do you really need to consider when looking for an AIOps platform? What are the pros and cons of buying a platform versus building one? Which approach can deliver the goods for your organization, so you get those high-fives and keep the awkward silences to a minimum?

Let’s find out.

What Is an AIOps Platform?

First, let’s clarify what we mean by “machine learning operations platform.” An AIOps platform is a software tool that gives you end-to-end support for your AI lifecycle. With an AIOps platform like Striveworks, you can build new models, deploy them into production, and-most importantly- maintain them, so they deliver good results over the long haul.

An AIOps platform should have strong functionality for these basic steps in the AI value chain:

Data ingestion, storage, and processing
Data exploration and visualization
Model development and training
Model validation, testing, and evaluation
Model deployment and serving
Model monitoring and observation
Model governance and auditability
Model remediation and retraining

Additionally, an AIOps platform typically needs to make it easy for your team to collaborate on machine learning projects, while also maintaining strict security and compliance protocols.

Why Do I Need an AIOps Platform?

Not every organization needs an AIOps platform. But if you plan to fast-track AI adoption and avoid getting left in the wake of your competitors, yours does.

An AIOps platform is a foundational tool for organizations to use AI effectively. Without one, teams are left to their own devices to hand-code models and email Jupyter Notebooks back and forth. If you’re a small organization, maybe that works. But that process isn’t very efficient, and it certainly doesn’t scale. These clunky processes are part of why so many AI projects fail to deliver any value.

An AIOps platform should provide clear, measurable return on investment in the following areas.

Faster time-to-value: By automating and standardizing the AI workflow, you can reduce the time and effort required to build, test, and deploy AI models.
Higher model quality: By using an AIOps platform to test and validate your models, you can improve their accuracy, reliability, and performance. You can also identify and address issues like drift and model bias.
Better scalability: Following a standardized process in a collaborative workspace lets you handle increasing amounts and varying types of data, as well as the growing demand for insights. An AIOps platform also gives you access to cloud services and architectures that can optimize your costs and workflows, especially when you have a lot of models in production.
Increased efficiency: An AIOps platform enables you to automate low-value and repetitive tasks, freeing up your data scientists and machine learning engineers to focus on more complex and creative activities. It also improves productivity through easier collaboration.
Enhanced security and compliance: AIOps platforms enforce security and compliance that protect your data and models from unauthorized access, tampering, or theft. Platforms also make it easy to ensure that your models meet the ethical and regulatory standards of your industry-especially critical if you work in highly sensitive environments where data handling has strict standards.

OK, I Get It. I Need an AIOps Platform. Can I Just Build One In-House?

If you have the resources available, you can certainly instruct your team to build a custom platform for your AIOps work. Building your own AIOps platform can offer real advantages to your organization.

Customizability: Because you decide which features and capabilities your team should build, you can trust that a homegrown AIOps platform will fulfill your needs. If you have an especially rare or unconventional use case, building a custom solution is only a matter of resources.
Control: If you build your own AIOps platform, you maintain full control over its features, integrations, and security. If you have a problem with your system, you decide when it gets solved without going through any gatekeepers.
Knowledge: Building an AIOps platform from the ground up will give your team an unmatched understanding of how it works — knowledge they can apply as they further develop your AIOps program.

Those Are Great Advantages. So, Why Shouldn’t I Build an AIOps Platform?

It’s true: Any bespoke AIOps platform will be finely tailored to your current operational needs. But building a platform from scratch is no easy feat. There is a huge range of pitfalls to consider before you take on an AIOps platform development project.

Time in development: It will require a huge investment in work hours from expert data scientists and software engineers to create a workable prototype of your AIOps platform, let alone a version ready for general availability. The project needs a series of phases-scoping, designing, development, quality assurance, and more-each of which can take weeks or even months to complete. Not only does such a project take the time of your highly skilled (and highly paid) staff, but it also takes time away from their normal projects-presumably ones that are vital to your organization, such as building your products or supporting your customers.
Unknown obstacles: Few projects ever go off without a hitch. Even experts’ best estimates tend to fall short of what is truly needed to accomplish the job. Scope creep, cost overruns, and project mismanagement are all unforeseen, but they can seriously hinder progress. These obstacles are common in home renovations and infrastructure projects, but they’re just as frequent in custom software projects-driving costs through the roof, delaying deployment, and sometimes leading to project abandonment altogether.
Ongoing maintenance: If the project goes well, an AIOps platform still needs regular maintenance. Software doesn’t exist in a vacuum. It needs someone to perform regular system upgrades, fix bugs, install security patches, and make sure the tool does what you need it to do. When building a custom platform, you need to consider how much attention and budget it will require in years to come. It’s also critical to maintain institutional knowledge about the project, or the whole thing may grind to a halt when a key project contributor retires or moves to a new job.
Moving targets: The field of AI/ML is changing so rapidly that it can be hard to keep up. Even with good insight into your current needs, you’d need a crystal ball to predict how the field will evolve just in the time it takes to develop your platform. AI is an ecosystem. Data providers, tools, and applications are in constant flux. Because homegrown AIOps tools are merely a side project to enable your monetizable work, internal developers are likely to struggle with emerging developments-new data types, new architectures, new models-which threaten to leave an outdated platform in their dust. Conversely, AIOps vendors stay on top of these changes as a core function of their business, so their platforms have the knowledge and preparation to handle things like upcoming versions of YOLO, RoBERTa, or Whisper.
Risk and compliance: It’s one thing to build a platform. It’s another thing to build a platform that complies with industry standards and regulations-especially ones that are prone to change because of the new and evolving field of AI. For example, the White House’s executive order from October 2023 set out new requirements for safe and explainable AI-and that’s only one of several pieces of legislation set to steer development of the technology as it gains more and more prominence in business and society.

That Sounds Challenging. Should I Buy an AIOps Platform, Then?

While there are real advantages to buying a commercial AIOps platform, it’s important to understand the full range of considerations that apply when you’re looking for an off-the-shelf solution.

Applicability: What do you really need in an AIOps platform? How do you plan to use it? Is your organization prepared to handle a new tool and the change management that comes with it? It’s not uncommon for a forward-thinking leader to see the value in a technology without considering how it may fit in to their organization, with all its unique quirks and requirements. Consider the questions below. There’s no right or wrong answer, but you want to understand them before you explore a costly software purchase.
How advanced is your data team? Do you have one data scientist or several? How well established are their processes?
How many models do you have in production? How many do you plan on building and deploying over the next couple of years?
How frequently does your data change? Are you doing high-frequency trading, real-time GEOINT, or other work where time is of the essence? Can you handle the rate of change without an AIOps platform?
What results are you getting from your current AI projects? Are they effective, or do they need adjustment to start bearing good results?
Compatibility: While a homegrown AIOps platform would likely be tailored to your priorities, a commercial solution may not meet 100% of your criteria. One-size-fits-all technology is designed to satisfy many common capabilities. Is the solution you’re exploring close to what you need? Are any of your requirements more “nice-to-have” than “mission critical”? Can custom development bridge the gap for what you need the platform to do? Is it worth ignoring all of the advantages of a commercial platform to take on the headache of building a wholly new one?
Vendor lock-in: Certain AIOps platform vendors make it difficult to extract your data from their system if you want to switch to a new provider. This is a real problem for organizations and end users alike. With proprietary technology, users can end up forced into staying on a platform with limited functionality and high licensing fees to avoid the pain and cost of switching to another vendor. But Striveworks customers don’t need to worry about vendor lock-in. Our platform uses open standards, so our customers can always access their data and models. Users can even export them in one click.

What Are the Advantages of Buying an AIOps Platform?

If your AI team is sophisticated enough to use an AIOps platform, it pays to get started sooner rather than later. Buying an AIOps platform often makes the most sense for the following reasons.

Speed: Obviously, buying a platform that has already been developed is a much faster process than developing one from scratch-and it gives a much faster time-to-value. You could conceivably build, deploy, and maintain your models on a commercial AIOps platform on the same day you sign a contract. A new model could start returning useful inferences for your business needs in hours. Pre-built integrations speed your progress too, saving you the trouble of configuring wholly new ones yourself. Plus, you don’t have to worry about slowing down due to platform updates and maintenance. System upkeep is the vendor’s job.
Accessibility: You don’t need to become a platform engineer if you buy a commercial AIOps solution. Instead, you can start with AIOps right away, taking advantage of a dedicated team’s expertise in designing and developing a worthwhile, stable solution. Junior engineers whose experience may be limited can still get models into production without knowing how to architect an entire platform. Many platforms, including Striveworks, also include no-code features that empower analysts and other non-PhDs to build and deploy new models in a few clicks.
Scalability: Commercial platforms are meant for heavy use by large numbers of users, which makes them ideally set up for scaling your organization’s AIOps program. While a homegrown system may work in its limited scope, a commercial AIOps platform has been tested and tweaked to behave well in lots of scenarios-even when you have five times as many models crunching 50 times as much data.
Security: Any commercial AIOps platform has to adhere to stringent standards for security and compliance in order for its vendor to stay in business. It’s much easier for a homegrown solution with only a small team maintaining it to overlook a critical security update that leaves your data-or your organization-vulnerable.

What About Cost?

Cost is a complicated factor when it comes to choosing to build or buy an AIOps platform. It may seem like a two-way street, but it’s more like an interchange of freeways looping around one another.

Building an AIOps platform requires more funding up front than buying one does. To build one, organizations need to hire new staff or direct existing employees to the project, secure cloud resources or on-prem servers, and fund the project for months or years before it can start to produce a return on investment-if it ever does. Conversely, any organization can buy a license for an AIOps platform at a much lower up-front cost and, conceivably, get profitable results the same day.

Of course, platform licenses add up. Over time, the steady licensing fees of a commercial platform never go away and could outpace the ultimate cost of building a platform in the first place.

Simple, right? Not quite. Both of these scenarios ignore an important additional cost: the ongoing cost of maintenance. Once a homegrown AIOps platform is built, there are no persistent licensing fees, but the platform still needs to be maintained. At a minimum, that includes cloud services and internal staff attention to support the platform. If you plan to further invest in the platform to scale or adjust its capabilities, the cost of supporting it could soar to much more than equivalent costs for a commercial platform. After all, economies of scale let a vendor maintain its platform more cost-effectively.

There’s also the question of opportunity cost. How much profit would you stand to generate if your data team was putting models into production and monetizing your insights instead of trying to construct a brand-new solution?

Of course, cost isn’t a great metric for comparing your AIOps platform options anyway. The right platform should be able to return exponentially more value for your organization-whether it’s built or bought-rendering the question of cost an afterthought. Instead, it makes more sense to focus on time-to-ROI. The faster you can produce effective models-and the longer you can keep them producing-the faster they can generate value and the faster you can scale.

What Do I Need to Know When Deciding to Build or Buy an AIOps Platform?

In AI, like in any hot field, there is no one right or wrong answer for building or buying a platform. An organization’s satisfaction with its tools depends on many factors: company culture, urgency, risk tolerance, budget, customization needs, and more.

That said, as you evaluate your options for an AIOps platform, here are some useful questions to consider.

How much money and time can you invest in AIOps? How much can you do with your current resources? How much support and guidance do you need from a vendor?
How much flexibility and control do you need over your platform? How much customization do you need?
How do you plan to manage your platform over the long haul? Do you have the skills and resources available to do it? If someone leaves, can your platform still function?
How do you see your needs changing down the road? Will you need more capabilities? More data types? More models? More staff to manage your workflows?
What tradeoffs are you comfortable making? Can you tolerate less security for more control on your end? Can you make do with a general platform that has broad functionality, or do you need a tool tailored for your specific domain? Can you work with a vendor to build out the missing functionality you need?
What’s your contingency plan? If your vendor doesn’t work out or your homegrown project stalls, how can you ensure that your organization still makes headway with AI/ML?

In the end, the most important thing is that your organization is able to use AI to do the things it promises to do: analyze more data faster, unlock new capabilities, and support decision-making that drives your business forward. Building or buying can get you there. But which one will get you there the quickest? Which one will produce ROI the longest? And which one will keep you focused on your team’s mission?

What Is AI Model Drift?

Striveworks — Wed, 17 Sep 2025 19:49:42 GMT

In our recent post on model retraining, we touch on an unfortunate but unavoidable fact of machine learning (ML) models: They often have remarkably short lifespans.

This fact is old news for ML engineers. But the problem comes as a shock to business leaders who are investing millions in the AI capabilities of their organizations.

Models typically perform well in the lab where factors are tightly controlled. However, once they start taking in real-world data, their performance suffers. Certain models may start out delivering great inferences only to have the quality fade in the weeks following deployment. Other AI models may fail right away.

In either case, model drift is incredibly common. Scientific Reports notes that 91% of ML models degrade over time.

What is going on? How can a technology relied upon by medicine, finance, defense, and other sectors just stop working? More importantly, what can be done about it?

Let’s dive deeper into the world of data science and explore the most important-but least talked about-problem in the field of AI: model drift.

Model drift occurs when incoming data in production shifts out of distribution with the data used in an AI model’s training.

AI Model Drift — The Day 3 Problem Defined

Model drift describes the tendency for an ML model’s predictions to become less and less effective over time. It is also known as model decay or model degradation. At Striveworks, we call it “The Day 3 Problem” in reference to a standard artificial intelligence operations (AIOps) process:

Day 1: You build a model.
Day 2: You put that model into production.
Day 3: The model fails, and you’re left to deal with the aftermath.

The Day 3 Problem crops up regularly at any organization that has more than a handful of AI models in production. Yet, it is often overlooked.

“Temporal model degradation [is] a virtually unknown, yet critically important, property of machine learning models, essential for our understanding of AI and its applications,” says Daniel Vela and his fellow researchers in their Scientific Reports article, “Temporal Quality Degradation in AI Models.”

“AI models do not remain static, even if they achieve high accuracy when initially deployed, and even when their data comes from seemingly stable processes.”

Key Takeaways

Model drift is the tendency for AImodels to fail in real-world applications.
It is also known as “model decay,” “model degradation,” or “the Day 3 Problem.”
Although incredibly common, model drift is “virtually unknown.”

Why Do AIModels Stop Working in Production?

As surprising as it is, AI models that evaluate well-or that even work perfectly well in production today-will stop working at some point in the coming days or weeks.

Why? The answer is simple: Things change.

“Models fail because the world is dynamic,” says Jim Rebesco, co-founder and CEO of Striveworks. “The statistical phrase is ‘non-stationary,’ which means that the data being put into a model in production is different from the data it was trained on.”

AI models are created by training an algorithm on historical data. They ingest thousands or even millions of data points-images, rows of numbers, strings of text-to identify patterns. In production, these models excel at matching new data to similar examples that exist in their training data. But in the real world, models regularly encounter situations that appear different from a set of specially curated data points-and even slight differences can lead to bad outcomes.

“Take a predictive maintenance model trained on a particular engine,” says Rebesco. “Maybe we originally deployed this model in the summer and trained it on production data from the same period of time. Now, it’s winter, and you’ve got thermal contraction in the parts, or the lubricating oil is more viscous, and the data coming off the engine and going into the model looks a lot different than what it was trained on.”

Engines are one example. But model drift isn’t confined to any single AI model-or even a type of model. It’s a fundamental property of AI. It happens to predictive maintenance models drawing on structured data, but it also happens to computer vision models looking for airplanes on airfields as the seasons shift from summer to fall. It happens when a voice-to-text model trained on American accents is used to transcribe a meeting with Scottish investors. It happens when a large language model trained on pre-2020s data gets asked to define “ rizz.”

Even popular, widely used AI models aren’t immune from model drift. In 2023, a paper from Stanford researchers showed that, over a few months, OpenAI’s flagship GPT-4 dropped in accuracy by 95.2% for certain problems.

In every case, the result is a machine learning model that produces inaccurate predictions from real-world data, rendering it useless or even harmful.

Key Takeaways

Model drift happens because the world is always changing.
It happens to every type of AI model.
It is an unavoidable, fundamental property of AI.

What Causes Model Drift?

Because the world is always changing in unpredictable ways, many different factors can spur a misalignment between a model’s production data and its training data. The following table shows the most common causes of model drift.

Key Takeaways

Model drift can result from various factors including natural, human-produced, and technology-produced changes in model operating conditions.

How Severe of a Problem Is AI Model Drift?

Model drift directly ruins an AI model’s performance. However, the importance and urgency of this bad performance can vary widely depending on how you use your AI model. In some cases, a drifted model may still be “good enough.”

For example, consider a streaming service’s recommendation engine. If it suggests an unexpected TV show because its model has misunderstood your taste, management may not need to panic about a Day 3 Problem. The model’s output is relatively trivial. So what if you don’t want to watch Suits? You aren’t about to cancel your subscription because Netflix queued it up for you.

In other cases, a model’s output is only useful for a particular window of time. Once that window passes, the model’s prediction is inconsequential-whether it was good or bad. This is the case with self-driving cars. If your model thinks a tree is a person, it only matters for the amount of time it takes for the car to navigate past it. A one-off bad inference in this case doesn’t matter that much. But if the drift is substantial enough that the AI model always identifies trees as people, then the problem is much more severe.

Of course, a vast range of outcomes exists between those scenarios-where model drift can spell catastrophe. A financial system executing high-speed trades can rapidly lose millions of dollars if its model drifts. Likewise, drift occurring with a model used in cancer screenings can result in a misidentified tumor-with life-or-death consequences. AI models used in defense and intelligence applications that can no longer distinguish between friendly and adversarial aircraft become immediately unusable in combat.

In all these scenarios, a failing model should trigger an immediate red alert. All too often, model drift happens for long periods of time before a human notices the problem. By the time a person can intervene, organizations may have made huge business or operational decisions based on flawed insights.

Key Takeaways

Not all model drift is important or urgent-but the problem is often very severe.
High-stakes industries such as financial services, medicine, and defense are especially vulnerable.
Drifted models can deliver bad inferences for a long time before anyone notices.

How Can You Tell When an AI Model Is No Longer Functioning Properly?

Often, data scientists have good heuristics that suggest when a model has drifted. Experience dealing with models in production gives them a sense that a model isn’t working the way it should.

However, this sense isn’t quantifiable, and it doesn’t scale when a data team has hundreds or thousands of AI models in production. In these situations, data scientists need more exacting ways of determining model drift.

At Striveworks, we use two statistical methods to determine if drift is occurring. They are the Kolmogorov-Smirnov test and the Cramér-von Mises test. Both of these tests are common statistical measures to determine if a dataset is “out of distribution” with a model’s training data.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test works by comparing the distribution of a dataset with a theoretical distribution. It looks for the maximum difference between the cumulative distribution function (CDF) of the data and the empirical CDF. This test is more focused on the center of distribution and is less sensitive to extremes at the tails. It’s a simple and versatile test that is non-parametric (i.e., it doesn’t require any assumptions about the underlying data), which makes it useful across a wide range of data distributions.

Cramér-von Mises Test

Like the Kolmogorov-Smirnov test, the Cramér-von Mises test also assesses the goodness of fit of two data distributions. However, instead of comparing the predicted CDF with an empirical CDF, the Cramér-von Mises test looks at the sum of the squares of differences in the CDFs. It considers the entire distribution (across the center and the tails), which makes it more effective at capturing deviation across the full distribution.

Both the Kolmogorov-Smirnov test and the Cramér-von Mises test are valuable-if different- ways to quantify if production data is out of distribution with a model’s training data. For a broader understanding of data distribution and drift detection, it makes sense to use both of them.

Key Takeaways

Data experts can usually tell when an AI model is drifting, but statistical measurements are needed to standardize drift detection at scale.
The Kolmogorov-Smirnov and Cramér-von Mises tests are two different, complementary options for quantifying model drift.

What Can Be Done About AI Model Drift?

Unfortunately, data scientists and machine learning engineers can do little to prevent model drift. The build-deploy-fail cycle that creates a Day 3 Problem continues to persist-even with the massive expansion of AI capabilities in recent years.

But the news isn’t all bad. Even though data teams can’t stop model drift from happening, they can take steps to reduce its effects and extend the productive uptime of models. At Striveworks, we refer to this process as model remediation.

Once drift is detected through automated monitoring that tests incoming data using the Kolmogorov-Smirnov and/or Cramér-von Mises methods detailed above, models become candidates for remediation. Model remediation involves confirming that an AI model has drifted and then initiating a rapid retraining process to update the model and return it to production. Unlike removing a failing model from production and training an entirely new one to replace it, remediation happens much more quickly. It typically leverages a baseline model and fine-tunes it with appropriate data to restore performance in hours-not the days, weeks, or months frequently needed to build a new model from scratch.

We’ll explore model remediation in more detail in an upcoming post. In the meantime, learn more about the model retraining step of remediation in our recent blog post “Why, When, and How to Retrain Machine Learning Models.”

Key Takeaways

Model drift cannot be prevented, but its effects can be reduced.
The process of resolving the effects of model drift is called model remediation.
Remediation is much faster than training and deploying a wholly new model.
By remediating ML models, data teams can maximize models’ effective time in production.

Understanding Model Drift Is Essential for Effective AI

Model drift is an inevitable fact for organizations using machine learning. Because the world is always changing, machine learning models in the real world soon begin to ingest data that looks different from their training data. When this data falls out of distribution, it can wreak havoc on model performance — especially in applications that are critically important, like medicine and defense.

Fortunately, there are solutions to fix the problems that come with model drift. Model remediation quickly retrains struggling AI models to restore their performance and return them to production. By detecting drift quickly and immediately starting the model remediation process, data teams can reduce the effects of model drift and keep their models performing in production over the long haul.

Frequently Asked Questions

What is model drift?

Model drift-also known as model decay, model degradation, or The Day 3 Problem — occurs when the performance of a machine learning model deteriorates over time due to changes in the underlying data or the environment.

Why does AI model drift happen?

Model drift happens because the real-world data that models encounter in production can differ significantly from the data they were trained on. This non-stationarity in data may stem from natural changes, adversarial actions, time sensitivities, and other factors.

How can we detect AI model drift?

Model drift can be detected through continuous monitoring of model performance metrics, comparing predictions with actual outcomes, and conducting periodic evaluations against updated datasets.

What are the consequences of ignoring AI model drift?

Ignoring model drift can lead to inaccurate predictions, misguided business decisions, operational failures, and — in high-stakes scenarios-catastrophic consequences.

Can AI model drift be prevented?

While model drift cannot be entirely prevented, its impact can be mitigated through model remediation-the process of rapidly retraining models using data that has a distribution more closely aligned with production data.

Operationalizing America’s AI Action Plan

Striveworks — Fri, 29 Aug 2025 19:16:55 GMT

The recent release of the White House’s AI Action Plan signifies a critical juncture in national strategy and technology, firmly establishing artificial intelligence as a central element of US national security and economic growth. This comprehensive blueprint is designed to accelerate the operationalization of AI at scale, with a strong focus on the Department of Defense.

At Striveworks, we are proud to have been at the forefront of operationalizing AI for critical defense and national security applications, and we applaud the ambitious objectives articulated in this Action Plan. The explicit focus on “winning the AI race” in the White House’s plan underscores the stakes of the intensely competitive global landscape. The “race” is not merely about the theoretical development of AI-but, crucially, is about its effective deployment in critical, often contested, operational settings. Conflicts in Europe and around the world have clearly demonstrated that if the United States fails to convert its academic and commercial leadership in AI into operational leadership, our adversaries will.

We commend the AI Action Plan for prioritizing real-world AI adoption and promoting a regulatory framework grounded in responsible use-with guidelines, best practices, and transparency rather than cumbersome, anti-competitive frameworks that stifle progress. With technology as paradigm-defining as AI, we have a deep humility and skepticism about being able to correctly regulate AI a priori . In particular, three of the Action Plan’s core themes, enabling AI adoption , leveraging commercial items, and the imperative of continuous AI evaluation , stand out as having a uniquely profound impact on its goal of making AI useful for those putting themselves at risk in defense of the United States.

Enabling AI Adoption

This pillar of the AI Action Plan emphasizes building an environment that fosters swift AI development and deployment. The speed with which AI is evolving, both in its potential capability, as well as in its application to real-world tasks, underscores the necessity for the United States to reform the way it procures technology-especially AI-enabled technology. Two Striveworks founders have written previously on how the federal government can and must adopt a fundamentally novel procurement model for AI that incentivizes competition, fast-onboarding of AI vendors, and novel pricing models to give every AI developer the “right to inference.” Taking steps to make this a reality would foster the objectives of the AI Action Plan; shifting to a real-time bidding and matching market for AI, as has been done in the financial and advertising industries, exponentially decreases time and costs associated with acquisitions.

Leveraging Commercial Items

The private sector is the undisputed engine of progress-in AI and other fields. Certain DOD programs have already embraced the Action Plan’s goal of “creat[ing] the conditions where private-sector-led innovation can flourish,” including the Army’s Next Generation Command and Control (NGC2), Enterprise LLM Workspace programs, and the Navy’s Project Overmatch. Efforts such as these should be expanded toward an economic model that is viable for large-scale procurement. Realizing this aspect of the AI Action Plan means embracing existing market-tested AI solutions and adapting government requirements to them, rather than wasting critical time and resources on bespoke, and often inferior, alternatives.

This emphasis on commercial items directly fuels competition, driving down costs and accelerating the deployment of secure, reliable AI systems. It bypasses lengthy development cycles, ensuring that our defense agencies acquire the best tools, now , at the speed of relevance-while maintaining the ability to shift quickly to other commercial platforms if needed.

This agile approach empowers our nation to win the AI competition, leveraging the full force of American ingenuity to equip our warfighters and protect national security against emerging threats.

Continous AI Evaluation

The White House’s AI Action Plan calls for building a robust AI evaluation ecosystem, including the establishment of a Virtual Proving Ground at DOD. To truly realize the plan’s vision and ensure that mission-critical AI systems remain effective, the United States must rethink the traditional, discrete model evaluation. Evaluating AI should not be a one-time event run in a laboratory prior to deployment; rather, it should be a continuous, real-time process that runs in parallel with every model-including, especially, those in production.

AI models, particularly in dynamic national security environments, are not static entities. They operate on ever-changing data, facing evolving threats, adversaries, and operational conditions. AI systems deployed in real-world environments face a host of dynamic challenges that static, one-time evaluations cannot predict or mitigate. Data drift and concept drift can erode a model’s accuracy as the characteristics of input data shift or as the relationships between inputs and outputs evolve. Adversarial attacks further complicate matters, with malicious actors actively manipulating data to exploit system vulnerabilities. Even benign changes-like failing sensors, network disruptions, or new environmental conditions-can degrade performance in unpredictable ways. In high-stakes defense contexts, this brittleness doesn’t just reduce effectiveness; it can introduce serious operational risk and erode trust that AI-driven decision-support insights are reliable.

To counter these challenges, a continuous evaluation ecosystem is essential. AI systems must detect performance degradation in real time and respond proactively rather than reactively. This enables rapid retraining and remediation through a responsive feedback loop that preserves operational readiness. Moreover, continuous evaluation fosters transparency and trust by providing an auditable record of system behavior over time. It also supports the development of more adaptive and resilient AI-models that remain effective even as mission needs and operational conditions evolve.

The AI Action Plan’s call for optimizing workflows and transitioning to AI-based implementations “as quickly as practicable” necessitates this continuous vigilance. True acceleration isn’t just about initial deployment; it’s about sustained, reliable performance to secure a national security advantage for the United States.

A Vision for Continuously Evolving AI Dominance

These tenets, underscored in the AI Action Plan, can help create a dynamic AI ecosystem driven by continuous, real-time optimization. This integrated approach-leveraging commercial innovation that is driven by continuous, fine-grained competition and that is rigorously assured through real-time data-to-model matching-transforms AI acquisition into an efficient market structure. It directly empowers the United States to maintain and expand its technological lead, ensuring that our national security remains at the forefront of AI, delivering a decisive advantage in an ever-accelerating global competition.

Originally published at https://www.linkedin.com.

Written by Jim Rebesco, Cofounder of Striveworks

How to Choose an AIOps Vendor

Striveworks — Mon, 25 Aug 2025 20:32:01 GMT

Once you’ve settled the “build vs. buy” debate, your research enters a new phase: how to choose an artificial intelligence operations (AIOps) vendor.

That first decision is thorny enough. (Hint: Unless AIOps is a core competency, save your money, time, energy, and focus.) But choosing an AIOps vendor will start a cascade of questions that can rapidly overwhelm even the savviest business leader. “What AIOps vendors are out there?” “How can I tell them apart?” “What do I really need to know?”

These are important questions. You’re about to enter into a relationship that both you and your vendor hope will thrive for years. Yet, every business leader has a story about a poor vendor choice-all of which resulted in weak results, wasted time, and a lost opportunity for value generation.

Making the decision to buy a commercial AIOps platform unlocks a whole series of research and questions. Explore our guide below to figure out the most important factors in choosing the right AIOps vendor for your operations.

AIOps is no different. And with so many vendors popping up amidst the AI boom, choosing the right AIOps vendor can feel like spotting Where’s Waldo? You know the right option is out there. But, at first glance, it’s hard to tell a great AIOps vendor apart from all the lookalikes.

So, how can you separate the wheat from the chaff to get the AJOps platform your organization needs?

We’ve put together this list of questions we’d ask if we were selecting an AIOps vendor (instead of using the Striveworks AIOps platform that we built). As you start evaluating your options, make sure your short list has good, trustworthy answers to these questions-or you may end up picking the wrong guy in a striped shirt.

1. How does your AIOps platform handle data governance and compliance?

It may seem strange to start your exploration of AIOps capabilities here, but data governance and compliance are among the most foundational features for reliable AI and machine learning (ML). Organizations that operate in regulated spaces already know this fact. But all organizations should take note of data governance as the AI landscape shifts in coming years.

AI and ML are in their early stages. They’ve only recently left the lab to find useful application in the real world. In fact, many AI frameworks and use cases are so new that legislation hasn’t caught up to them-yet. AIOps platforms that struggle with data governance are at serious risk of displacement when regulators get around to inspecting and establishing guidelines around AI auditability and oversight. If your AIOps vendor is cavalier about security, privacy, and data lineage, you’re at risk of losing access to your critical models when the regulators find out.

Follow-up questions:

How does your platform handle data encryption?
Do you have role-based access controls?
How do you handle versioning and data lineage?

2. Is your platform open or closed?

AIOps isn’t the end point for your data-it’s part of the process for transforming your data into insights. Ideally, this grunt work would disappear into the background. For that to happen, you need an AIOps platform that integrates easily with other solutions you use-data sources, annotation suites, orchestration tools, business intelligence studios, etc. An open platform ensures that you can configure the workflow you need, plugging in AIOps capabilities as appropriate. Closed platforms (you know who you are) prevent these integrations and impose heavy burdens to extracting data, switching vendors, or customizing your solutions. As you screen vendors, envision how AIOps will sit within your workflow-and whether or not the vendor lock-in of a closed platform is worth a “killer feature.”

Follow-up questions:

What’s your data ownership policy?
Are your model and data formats proprietary?
Do you have an open application programming interface (API)?
What integrations does your platform have?

3. Does your platform have no-code support for any AIOps processes?

User experience is the ultimate make-or-break characteristic of a new technology. A user-friendly workflow is essential for your team to adopt an AIOps platform and use it to advance their productivity. If the experience is lacking, you may as well just use Jupyter Notebook.

A low-code to no-code user interface can greatly accelerate standard parts of the model development process, such as dataset creation, annotation, and training. At the same time, you want to maintain options for code-first users who need more fine-grained control over model building, deployment, and maintenance. Before investing in a relationship with an AIOps vendor, consider how a platform will streamline steps in your team’s workflow to help them work faster or more effectively toward your business outcomes.

Follow-up questions:

What no-code elements does your platform offer?
Is a software development kit (SDK) necessary to build and deploy an ML model with your platform?
What options do you have for code-first users?
How does your platform save time for both experts and novice users?

4. How does your AIOps platform perform at 50 models? At 5,000?

Today, you may only have a few models in production. But if you are reviewing AIOps platforms, you obviously plan to expand that number in the near future-and you need a platform that grows with you. Ask your candidate vendors how their platforms perform at each tier of service. Can they handle serving thousands of concurrent models without a decline in performance? What are the technological requirements needed for that increase?

Consider pricing as well. Does the vendor charge a standard licensing fee? Or, do they charge per model, GPU hour, or inference? Depending on your use case, you may find that certain pricing models make sense at the start of your AIOps journey-but become prohibitively expensive as your adoption scales.

Follow-up questions:

How does platform licensing work?
Can you explain your process for model licensing? Data licensing?
How does auto-scaling ensure high availability while minimizing compute costs?

5. Can you deploy on premises?

Flexible deployment may not matter to most organizations, but it’s a critical capability for those who need it. Most corporate enterprises are best off with deploying their AIOps capabilities on a commercial cloud. But plenty of industries, such as defense, intelligence, aid agencies, NGOs, mining, etc., deal with highly sensitive data or operate in remote locations far from robust, affordable internet connectivity. These types of organizations need the ability to deploy machine learning tools on prem-potentially in a disconnected environment. Many AIOps platforms are reliant on commercial cloud technologies to operate, blocking them from projects like delivering a large language model (LLM) and retrieval-augmented generation (RAG) pipeline on an air-gapped, top-secret network. You already know if you need this flexibility for your data-but it’s important to recognize that not all vendors have it.

Follow-up questions:

What cloud providers have you deployed into?
Does your platform have an authority to operate in IL4, IL5, IL6, or IL7 environments?

6. What are your areas of expertise in AIOps?

Like all vendors, AIOps providers will be stronger in certain areas based on their technology and customer profiles. Certain companies may have strengths with structured data, others with unstructured data. Some specialize in handling the sensitivities needed for public sector customers while others are commercial-focused. One platform may shine with early stages of the AIOps workflow, like data preparation and annotation. Others may lead the industry in post-production -monitoring, evaluation, and retraining.

There’s no right or wrong answer for which solution is better-just a spectrum of strengths and weaknesses that make platforms right or wrong for your applications. Consider your biggest needs. Are you just getting models into production? Do you have a primary data type you interact with? Are managing drift and remediating degraded models major concerns? Your answers will point you toward one solution over another.

Follow-up questions:

Does your platform support video streaming?
Does your platform support multispectral and hyperspectral imagery?
Does your platform support specialized image formats, such as NITF and TIFF?
Does your platform fully use metadata from files?
How do I build a new model with your AIOps platform?
How do I deploy models into production with your AIOps platform?
How do I maintain models in production using your AIOps platform?

7. How involved is your customer success team?

Any good AIOps vendor will assign you a dedicated account manager when you become a customer. But every customer success and support organization operates differently. Find out how many customers each account manager oversees and what their standard level of involvement is. No one wants a vendor that disappears once your system is up and running. But too much involvement from an account manager can also send up a red flag, indicating problems with the software or other difficulties with self-service.

Follow-up questions:

How can you prove the value of the platform?
Can I see your product documentation?
What training do you offer?
What’s your experience delivering professional services?
Which phases of the model life cycle can your team help support?

8. Do you have a case study with a customer who is similar to me?

When free trials aren’t reasonable (a common challenge with enterprise B2B solutions), case studies are your best assurance that an AIOps platform will work for you. Success stories and client testimonials from potential vendors are worth evaluating closely for key details. Specifically, consider:

What data types were involved?
What did the vendor do (vs. what did the customer do)?
What industries were involved?
What kind of results did the project produce?
How reproducible are the results?

Obviously, case studies are marketing tools meant to make the vendor look good. But they also contain essential details that suggest whether or not a vendor has the experience to help you in a similar way as their other customers.

Follow-up questions:

Who was involved with this case study?
What gave rise to the results?
Can I speak with your customers?

9. Can you share your product roadmap?

Innovation isn’t mandatory in all areas: A mousetrap is still effective over 100 years after its invention. But AI and ML are changing so rapidly that you want to partner with an AIOps vendor that has clear foresight into developments coming down the pike. Technological obsolescence can ruin the value of your AI models, so you want to make sure that your vendor is investing and developing capabilities that are meaningful to you over the long haul. Look at your prospective vendors’ roadmaps to confirm that they plan to add new, relevant capabilities in the coming months-and that those capabilities meet your expectations for growth with your AI practice.

Follow-up questions:

What’s your area of focus over the next quarter? And the next year?
What is the long-term vision for the company?
What capabilities are you adding for X?

10. What’s the time to value for your AIOps platform?

Cost makes a difference. That said, for a tool that rapidly accelerates results and profits as much as AI, total cost of ownership is the wrong metric. Instead, explore each platform’s time to value.

AI use cases, models, and data all vary widely. At the same time, each AIOps vendor may use a different pricing model-usage-based pricing, subscription models, model-deployment pricing, or something else entirely. One vendor’s low up-front cost may work best for some scenarios while another’s monthly subscription may make sense for others. Ideally, you want a custom estimate of the time to value for your most challenging problem. From that number, you can extrapolate the value possible from deploying additional models and putting your platform to full use.

Follow-up questions:

How does your pricing model work?
Do you have cost comparisons with platforms?
What cost-saving programs or features are offered?

***

Choosing an AIOps vendor is a major decision for the maturity of your AI program and your organization as a whole. AI and ML can deliver transformative capabilities, and AIOps platforms are instrumental in putting models into production and keeping them working at scale.

Let these questions guide you in selecting an AIOps vendor that delivers what you need for your organization. With the rise in AI, more options are available than ever before with the expertise in data types, workflows, integrations, deployment options, and industry specifics to propel your AI program forward.

Want to know more about the Striveworks AIOps platform? Request a demo today.

Eric Korman Explains Valor and Its Step Change for Model Evaluation

Striveworks — Fri, 22 Aug 2025 20:05:57 GMT

Eric Korman is the Chief Science Officer at Striveworks. He leads our Research and Development Team, which recently released Valor-our first-of-its-kind evaluation service for AI models.

We caught up with Eric to learn more about Valor, AI model evaluation, and why this open-source tool is a game changer for maintaining the reliability of AI models in production.

How did your machine learning research ultimately lead to Valor?

In the AIOps space now, there are a lot of point solutions around model deployment and data management and experiment tracking. But what was really lacking, before we launched Valor, was a modern evaluation service. This is a service that will compute evaluations for you, store them centrally, make them shareable and queryable, and also provide more fine-grained evaluation metrics than just a single, all-encompassing number. It lets you really get an understanding of how your model performs-understanding different segments of your data, properties, those things. That’s the need we saw, so we built Valor to address that need.

Can you explain why that’s valuable? Couldn’t I just put a model in production and see how it performs for myself?

You can definitely do an eye test, but that’s not always reliable or quantitative. Plus, we’re seeing this explosion of AI, so there’s a proliferation of models available to use. Teams are deploying multiple models at once. Deploying models and spot-checking how they perform is not a very scalable way to evaluate them. You want something systematic and also something you can trust. Valor is open-source. You can see exactly how it computes metrics-so, the number it spits out, you know exactly where it came from.

In my experience-on teams I’ve been on, and speaking to other data scientists-evaluation may be done programmatically, but then it’s stuck in some spreadsheet somewhere, or some report, or some Confluence page. There’s a lack of auditability and trustworthiness. So, Valor handles that-not just by computing the metrics for you but also then storing them for you, so you can trust your model evaluations and do it at scale.

Valor is the first of its kind, in terms of an open-source solution. What other approaches were people using for model evaluation before we launched it

Valor is pretty unique. People have built their own internal solutions. We see a lot of that in general in the AIOps space-a mix of taking something that’s open-source, expanding upon it, integrating it with something you build internally. A lot of stuff is still done in Jupyter Notebooks, which are great for exploratory data analysis. You can do some model evaluation there. But really, we haven’t seen an evaluation service before, because we’re just now getting to where data science is not a science experiment, but it’s done at scale and it’s in production.

You see a lot of tools where the end output is a report that goes to someone. That’s cool, but when you start deploying these things, you want systems and processes that tie into each other. You don’t want a report, you want some service that you can query to get metrics, and then you might want to act on that information in an automated way. So far, people have had to build that in-house. There are not many solutions that are open-source and general-purpose the way Valor is.

Would you say that being open-source is a big part of what Valor brings to the table?

I don’t think being open-source makes it unique. Its functionality is very unique. Valor is able to encompass a lot of use cases with the right kind of abstractions and generalities. There’s uniqueness in that it both computes and stores the metrics. Maybe most important is the flexibility it gives you to attach different metadata and information to your models or datasets or data points and, really, evaluate model performance against that data and stratify evaluations with respect to different metadata. It gives you ways of determining bias, and things like that.

We see lots of AI research and AI tools that came up from these benchmark datasets-your ImageNets, your COCO. They’re simple datasets in the sense that, in the real world, you have imagery datasets, but you’ll have a host of metadata attached to that. “What sensor took this picture?” “What was the location of this picture?” “What time was it taken?” All that rich metadata can be useful to understand model performance. If you’re running the same model on images that come from a bunch of different cameras, you might want to know, “Does my model perform better on one camera versus another?” “What about nighttime versus daytime?”

For all those more in-depth questions, there were not a lot of tools-until Valor-that helped you get that analysis into model performance.

How do you see Valor getting integrated into AI workflows? How are people using it on a day-to-day basis?

With our open-source offering of Valor, we’re hoping that it’ll get community adoption and integrate with an organization’s AIOps tech stack. It fills a gap. If an organization has their setup for doing model training and deployment, they’re probably missing this evaluation piece, and Valor was engineered so that it can easily integrate with what comes to the left of model evaluation.

What motivated the decision to open-source Valor? How do you think it will shape the evolution of the product?

There were a few reasons for wanting to open-source it. The virtuous reason is that the AI community does a really good job of open-sourcing things. Everything in AI is built on top of open-source foundational pieces. All the deep learning libraries that are used are open-source. Underlying databases are open-source. So, we wanted to give back.

Then, obviously, the business case: It’s a piece of our brand. It’s showing off what we can do. It’s also showing off our worldview. We think we have a unique and emerging view of AIOps, where we really want every piece to be not just scalable but also auditable-and, again, where the output is not a report but something that you can take action upon in an automated way. Another phrase we like to use: People are used to “infrastructure as code,” but we want to take that further and really do “process as code.” That gives you auditability of processes.

Where do you think Striveworks fits within the broader AI community?

Valor is a good example of one of the ways that we are trying to reach out and be an axon to connect to the rest of the AIOps community. We’ve done a really good job of building partnerships on the application layer- companies and industries we can partner with to enable their products or applications to be intelligent by putting Striveworks machine learning under the hood. Now, this is a way to do that sort of thing, but at the lower level-at the platform AIOps level.

How do you see the future of model evaluation? Where do you see Valor fitting into that future?

At Striveworks, we talk about the Day 3 Problem. On Day 1, you build a model. Day 2, you deploy a model. Day 3, when the model is in production and things change, that third day is when that model fails. Talking about Day 1, Day 2, and Day 3 implies a line, but really, at failure, you go back to Day 1. You want to refine and retrain your model to make it more performant. Striveworks has a lot of IP around that Day 3 Problem, not just through Valor but also through our model monitoring techniques and other tools.

How do we detect model failure? When we say failure, we don’t mean when the network goes down or some computer crashes, so your model’s not deployed. You can detect that straightforwardly. Rather, failures due to data drift are more what we’re trying to build technologies to detect. Valor does that by doing model evaluation against human ground truth. Our monitoring capability does that in an unsupervised way, where it detects changing input data and flags it.

Going back to our viewpoint that everything should be a process and easily integrate with the rest of the pieces of the pipeline, that’s where we’re going with our offerings, including Valor. So, if there’s model drift detected or evaluation numbers go down, that’s just not an email that gets sent to someone, it can be fed into a retraining pipeline.

For example:

“Through monitoring, determine the data points where the model performs poorly.”

“From that, automatically create an annotation job to get a human to annotate that data and get it retrained.”

“Via Valor, compute metrics and see if the model actually improved on that newly annotated data compared to the previous model.”

It’s just really building this pipeline. It’s not going to automate everything-at least right now-but it’s going to automate the parts that are automatable, have a human do what humans excel at, and let computers do what computers excel at.

Looking ahead, what’s your vision for Striveworks? How does Valor align with these long-term goals?

Lofty goal is we’re the premier company for Day 3 technology-for model monitoring, detecting model failure, model evaluation. Then, being able to do that in a way that makes models easy to remediate through retraining and fine-tuning. Valor fits into that as one of the critical components of identifying if your models have problems.

Interested in model evaluation with Valor? Try it yourself. Get Valor from the Striveworks GitHub repository to start using it in your machine learning workflows today, and read Valor’s official documentation to learn more.

Why, When, and How to Retrain Machine Learning Models

Striveworks — Fri, 22 Aug 2025 20:05:41 GMT

Once you train a machine learning model effectively, it has the ability to perform incredibly useful tasks, like detecting objects on real-time video, summarizing huge documents, and even predicting the likelihood of wildfires.

Unfortunately, the useful lifespan of many trained models is very short, which is one reason that the majority of trained models never make it into production-and why AI initiatives fail to reach maturity. These limitations raise important questions:

Why do so many models have such a short lifespan?
What causes model degradation?
How can you detect when a model is no longer functioning properly?
And, most importantly, how do you resuscitate a failing model?

Let’s explore the reasons why so many organizations find themselves with models that fail-and how we can use the Striveworks platform to ensure that models remain relevant.

Why Do Machine Learning Models Need Retraining?

Once trained, most models are static.

For models to adapt, they need feedback. To train-and to get better at the thing it is trying to do-the model has to know if the output it generated was the right thing.

An AI model playing a game can determine whether or not it won, which gives it feedback. A model predicting the rest of the word you are typing gets feedback when you finish that word and move on.

But an AI model that is detecting people in a security camera has no way to know if it missed someone or if it thought the mannequin in the background was actually a person. There’s no automatic feedback mechanism.

However, while models are generally static, the world around them is constantly evolving. Sooner or later, data from the real world looks very different from the data used in the model’s training. Any model’s performance will be poor (or, at best, occasionally lucky) when the data it is ingesting is significantly different from the data on which it was trained. As a result, the model’s performance-its correctness-degrades. This is commonly known as model drift. If the operating environment evolves very fast, that drift can happen in days-or even hours.

At Striveworks, we call this the Day 3 Problem:

On Day One, your team builds a model.
On Day Two, you put that model into production.
On Day Three, that model fails and you’re left to deal with the aftermath.

When a previously effective model performs poorly on current data, the solution is always the same: Retrain or fine-tune the model using that same current data.

Below, we illustrate a few scenarios in which model drift can negatively impact model performance. These examples focus on computer vision models, but the same issues occur with natural language processing models, tabular classification models, and every other domain.

Example 1: Facial Recognition Systematically Failing for People of Color

It has been widely reported that early systems for facial recognition failed dramatically on people with darker skin tones.

In the early days of facial recognition, the problem was that all (or nearly all) of the training data used to build the facial recognition systems consisted of white faces.

In effect, the world the model saw, because of its training, consisted only of white people. Obviously, the world in which the model was deployed looked very different.

In this case, the real world didn’t change. However, the model still needed retraining because of the discrepancy between training data and the data the model encountered in production.

Example 2: Seasonality Changing the Natural Environment

Let’s suppose you want to use a model to identify building footprints, roadways, and waterways in the United States from satellite imagery.

To train your model, you collect recent sample imagery from each state and then diligently label all the collected imagery. This effort gets you a good representation of built features across the country.

Although the model may initially work very well, as summer turns to fall or fall turns to winter, the image characteristics change significantly.

Since the training data was all recent, it all came from the same season. While it is true that spring in Alaska looks quite different from spring in Texas, a change in the seasons will still cause a significant shift in the makeup of the imagery. The images could have more green or brown in warmer months and more white in colder months. The majority of water may be frozen in January but liquid (or even evaporated) by August.

Each of these factors alone could impact a model. All of them together will almost certainly produce negative effects.

Example 3: New Vehicles on the Roads

Consider a model that is monitoring vehicles in a parking lot. One of its tasks is to classify vehicles by their make and model.

Let’s assume your AI model is well trained on imagery provided by your cameras over a reasonable time period. It performs well.

Because every year there are new makes or models of vehicles on the road, your model will eventually start to see vehicles that weren’t included in its training data. Over time, vehicles that didn’t exist when the training data was collected will start to visit the parking lot.

If the new vehicles are similar enough to old vehicles, then your model will likely continue to correctly identify them. But in the case of a brand new make or model, your neural network may not even have the “language” required to provide a correct response.

Until the machine learning model is retrained with an expanded vocabulary, it will make wrong identifications when encountering these new vehicles-consistently confusing them with a vehicle already in its vocabulary.

Example 4: New Sensors Altering Imagery

What if you have a model trained to recognize people in your security camera feed?

In this scenario, the model is well trained on imagery taken directly from the camera over a long period of time (e.g., in varying weather, seasons, congestion, etc.)

After some time, the security camera receives an upgrade to a much higher resolution camera.

Intuitively, you may expect that the model would perform better with sharper, clearer imagery. But the reality is that the model will likely perform worse.

Again, the reason is that image characteristics have significantly changed. Many parameters change with new cameras, including the perceived size of the people in the image. The model may have learned that people are typically 30 pixels wide and 100 pixels tall. This learned feature, along with others operating under this assumption, served the model well with the original camera resolution. But the new model’s resolution is at least twice that of the original camera. As a result, people are now typically 60 pixels wide and 200 pixels tall-or more, if the resolution is even higher.

With this higher resolution, all the model’s internal learning about what a person looks like is wrong. As a result, the model will almost certainly make many mistakes.

When Do Machine Learning Models Need Retraining?

The short answer to this question is obvious: When a machine learning model no longer performs well enough, it needs retraining. And, of course, “well enough” depends on the user and the use case.

But the follow-up is more serious: How do we know when a model isn’t performing well? Keep in mind: When a model is in production, there are usually no ground truth labels we can use to measure our model’s output.

One option is to simply require human-in-the-loop monitoring of a model’s output, say, by randomly sampling inference outputs, supplying a correct label (model output), and scoring the model against the human-assigned labels.

But this method is labor intensive and slow-by the time problems are detected, a bad model may have been in use for some time. It is much better to take poorly performing models out of production before they cause too much damage.

Yet, there is a second answer to our original question that is less obvious-but it lends itself to fully autonomous monitoring.

When do machine learning models need retraining? When you no longer trust the model.

Trust in a model can and should erode quickly when we apply the model to data that is unlike its training data. Fortunately, there are a variety of ways to automatically determine if new data that a model is seeing is similar to the model’s training data. Out-of-distribution (OOD) detection algorithms can show us if new data is similar enough to old data.

Out-of-Distribution (OOD) Detection With the Striveworks AIOps Platform

In the Striveworks AIOps platform, OOD detection begins with a characterization of the dataset.

Figure 1 illustrates the process of characterization.

Figure 1: Illustration of how a collection of images gets transformed through a neural network model into a collection of embeddings (or, lower dimensional representations of data). We can describe the embedded dataset statistically by fitting a multivariate normal distribution and recording the mean and covariance.

First, we take a dataset and use a generic neural network embedding model to generate low-dimensional representations of each datum (i.e., each image).

Think of these embeddings as just a vector or a list of numbers.

Figure 1 illustrates these embeddings as two dimensional vectors: x-y coordinates.

With our collection of embedding vectors in hand, it is a straightforward process to statistically describe them by computing their sample mean and covariance.

The mean and covariance characterize our data by a statistical distribution. One of the things you’re likely to learn in an introductory statistics course is that 75% of the data points in a dataset are within two standard deviations of the mean, 89% of the data are within three standard deviations, etc. (These numbers are true for any dataset, regardless of distribution. Percentages are even higher if the data is normally distributed.)

This implies that it is rare for data to be a large number of standard deviations away from the mean.

If you observe a single point that is, say, five standard deviations away from the mean, that is rare, but not necessarily shocking-roughly 4% of the dataset could be that far. However, if you start to observe many points that are five or more standard deviations away from the mean, then you can trust that at least some of these new points are from a different data distribution than your training data. They are “out of distribution.”

Using the mean and covariance calculated above, we can compute the Mahalanobis distance between our training dataset and any new data via that new data’s embeddings. This gets used instead of standard deviation when the data has more than one dimension.

Of course, there is nothing special about five standard deviations or a particular Mahalanobis distance. Rather, a data science practitioner must decide the exact distance at which data points are surprising enough to be called OOD.

In addition to identifying individual data points that are far from the training distribution, the Striveworks AIOps platform allows users to check whether an entire data stream being passed to a model has drifted from that model’s training distribution. Along with aggregating the OOD detection scores discussed above, we also apply a few classical tests from non-parametric statistics to the inference data stream. As in the case of OOD detection, we use a neural network model to compute embeddings for both the training and inference data, and then we apply the multidimensional Kolmogorov-Smirnov and Cramér-von Mises tests to check for model drift. These tests use two different criteria for measuring the distance between the cumulative distribution functions of the training and inference data.

How Do I Retrain a Machine Learning Model?

The mechanics of retraining a machine learning model are (nearly) the same as those used when initially training the model. But this time, we have a model in hand that already performs well on relevant data.

Model training (or retraining) always starts from some set of weights.

Those weights may be chosen randomly from some statistical distribution. This is what happens if this model has never before been trained. Alternatively, weights may come seeded from a previous training of a model if and when they are available. In some cases, weights may come from a training of the model on a completely different dataset. For example, for computer vision models, it’s common to use “pre-trained” weights resulting from training on the ImageNet dataset. When this happens, we generally don’t consider it to be retraining a model. Instead, we consider this a case of transfer learning.

Model retraining relies on a good seed-the initialization of the weights from the original model training.

Having good initial weights for retraining allows the process to run much faster than the initial, from-scratch training of the model. This is because we are essentially starting out with a 90% solved problem and just looking to improve at the margins.

When retraining a machine learning model, the single most critical component is the curation of the dataset. We cannot simply resume training on the original dataset and expect to get better performance on the new, OOD data because the training set, not the model being trained, dictates what is in- or out-of-distribution. If the dataset doesn’t change, the distribution hasn’t changed, and OOD data is still OOD-you can’t feed the model more examples of dogs and expect it to learn to recognize cats.

It is therefore essential to include recent, novel data in the new dataset for retraining.

Because this data is novel and OOD with the original training data, we need a human to help annotate it prior to training.

There are still big questions outstanding, though: Should you include the original dataset in the new training data? What ratio of novel OOD data to original data should you maintain?

In many cases, it will be beneficial to include the original data and simply augment it with some novel data. These are cases when the real-world data has expanded in scope, rather than changed.

Consider the first example above that discussed facial recognition. To solve the lack of representation, it wouldn’t make sense to replace all the white faces with non-white faces. That would simply skew the model in a different, wrong direction. Instead, the distribution needs expansion to include both.

In other cases, there may be multiple viable routes to good results. With seasonal changes (e.g., the second example above), we expect the data to eventually return to the original distribution-but most or all of the current data may be different from the original training data. In this case, we can try to build a single model that can operate in both conditions. To do so, we’d want to augment the initial training data to include new seasonal data. Or, we could build a small collection of specialized models-each one only operating on in-season data-and swap them in or out as seasons change. Here, we would want to have distinct training datasets for each season. New data would become its own dataset.

If the world has fundamentally changed and will not return to the old normal (e.g., Example 4 above with new sensors collecting data), then it no longer makes sense to include the old data in our new dataset.

What Tools Does Striveworks Have to Help?

The Striveworks AIOps platform is centered on postproduction machine learning: monitoring, evaluating, and retraining machine learning models back to good performance.

When you first register a dataset in the platform, an alert triggers a statistical analysis of the dataset.

The resulting characterization (mean, covariance, etc.) is stored and associated with the dataset. This information is used to recommend models that may work well for the dataset. It also forms a useful starting point for training a new model and recognizing OOD data.

Figure 2: The view of a dataset in the Striveworks AIOps platform. Registering the dataset triggers an analysis of its statistical description, which gets associated and stored with the dataset.

If a model is trained on a registered dataset, then we can automate drift detection on the model while it is in production. The Striveworks AIOps platform has several options for assessing drift, including semantic monitoring, the Kolmogorov-Smirnov test, and the Cramér-von Mises test. Each option characterizes model inputs as either in distribution or OOD, flagging the associated inference to inform whether or not drift has been detected.

Figure 3: Models trained in the Striveworks AIOps platform have several options for detecting drift on incoming data. When drift is detected, the platform flags the inference in red, as shown above.

If the user observes enough OOD data, they need to take action.

First, as discussed above, is curating improved, more current data. Striveworks users can easily load new datasets or choose a registered dataset from their existing catalog.

However, the most useful data for retraining is already stored on the platform. The Striveworks inference store captures every output generated by your production models. When you receive an alert that your production data is OOD with a model’s training data, you can explore this inference store to confirm or refute any errors in prediction. If your data has indeed drifted away from your model’s training set, you fortunately have a ready-made collection of specific, current data on which to retrain your machine learning model. You just need to annotate it.

( The Striveworks AIOps platform also provides a utility to assist these annotations.)

Using a pre-trained model to assist with annotations has many benefits. For current data where the model is already good, a human only needs to verify that the annotation is correct-usually, this is very fast. This process allows human annotators to focus their attention on data where the model is failing, concentrating effort where it is most needed. The platform also allows models to be trained as annotations occur, improving annotation hints over time. This feature is very effective for fine-tuning the model.

After enough current data has been annotated, it is a simple matter to retrain the model. Striveworks users can get a headstart, using our training wizard to fine-tune models in their catalog.

Figure 4: To retrain or fine-tune a model, click the wrench icon on the model overview page. The wizard will walk the user through the remaining steps to begin retraining.

Evaluating Retrained Models

Once you have retrained your model, it should be ready to redeploy into production. But not every retraining run is created equal. Certain training data or hyperparameters can produce more effective models for your use case than others. So, before you spin up another inference server, the best practice is to conduct an evaluation on your retrained model. Evaluations let you test whether or not the retraining dataset and settings were truly appropriate to address your issue in production. They also provide quantifiable metrics for performance by checking whether or not a dataset falls within distribution for your retrained model.

Depending on their training, models may not generalize well to fine differences and, therefore, may exhibit performance bias-performing better on one subset of data than another. Pre-deployment evaluations expose instances of performance bias before models return to production, letting you further tweak your models to ensure that they produce trustworthy results.

Striveworks users can evaluate their models using the platform’s built-in evaluation service. ( Check out our open-source evaluation service, Valor, on GitHub). Compare metadata and evaluation metrics for models trained on the same dataset and for a single model across datasets. The goal is to better understand expected model performance across a full dataset, plus changes in performance based on fine differences in data segments.

Preparing for Next Time

Of course, the unfortunate truth is that even those trustworthy models only last so long. Redeploying your model is as simple as spinning down your old inference server and spinning up a new one-but as soon as you do, the clock starts ticking until your model needs retraining.

Consider how you want to manage these repetitive cycles. If your number of models or urgency is low (or if you have an abundance of time), you can likely remediate by hand. But for managing multiple models, especially those in real-time operations, consider an AIOps platform designed for your model and data types.

Look for a platform with established infrastructure and standardized processes for managing monitoring, datasets, inferences, annotation, evaluation, retraining, and so on. Activity, versioning, and data lineage should all be centralized, making it easy to execute the tasks associated with model remediation in a consistent way at scale-instead of needing to search through Slack or Confluence for information or rebuilding infrastructure from scratch each time you need to update a model.

The Striveworks Platform Is a Workstation for Retraining Machine Learning Models

It’s important to remember that models perform best on data that is most similar to their training data. But a static dataset is a snapshot in time and, eventually, current production data will not look like the model’s original training data because the world is constantly changing. When this happens, the model will likely perform poorly. At a minimum, the model will no longer be trustworthy.

It is essential to remove bad models from production as quickly as possible when they are no longer trustworthy-before the model leads to bad decisions or outcomes. The Striveworks AIOps platform provides tooling to quickly recognize a shift in data and the accompanying loss of trust in a model, making it as easy as possible to retrain or update that machine learning model and get it back into production. Automated drift detection monitors production inferences for OOD data to alert users to the need to retrain. The inference store saves all model outputs, creating a highly appropriate, turnkey dataset for fine-tuning. Model-assisted annotation pipelines speed labeling along, and a persistent model catalog makes it easy to check out or reshelve models as needed and as appropriate. The Valor evaluation service ensures you deploy retrained models that are effective for your production data.

Ultimately, the platform serves as a complete workstation for model remediation, keeping models in production and generating value longer.

Understanding Performance Bias With the Valor Model Evaluation Service

Striveworks — Fri, 22 Aug 2025 20:05:25 GMT

Machine learning benchmarks like ImageNet, COCO, and LLM Leaderboard usually target a single metric, such as accuracy for classification tasks or mean average precision for object detection. But for real-world problems, using a single metric to judge performance is usually not a good idea-and can even be misleading. Consider a fraud detection model: If 0.1% of transactions are fraudulent, then a machine learning model that predicts that every transaction is not fraudulent will be 99.9% accurate-but is actually completely useless.

Even considering a host of other metrics, such as class-wise precision and recall, confusion matrices, or receiver operating characteristic (ROC) curves, will not give a complete picture. The crucial thing lacking from these metrics is an understanding of performance bias: when a model performs worse on a particular segment of the data than the whole. The history of machine learning has plenty of examples of performance bias, including many newsworthy ones. There are many instances of AI models being biased against people of color, such as healthcare models, lending models, and facial recognition models. Some LLMs have been shown to have a geographic bias.

Striveworks now has an open-source tool, Valor, for understanding these different types of biases. This model evaluation service exposes performance bias by defining a subset of data through filters on the data (and any arbitrary attached metadata) attached to Valor objects. It has first-class support for:

Simple data types (numeric data, strings, booleans)
Dates and times
Geospatial data (via GeoJSON)
Geometric data

Below, we explore how machine learning teams can use Valor to gauge these sorts of model biases.

What Is Valor?

Valor is an open-source model evaluation service created to assist machine learning practitioners and teams in understanding and comparing model performance. It’s designed to fit into a modern AIOps tech stack; in particular, Valor is the model evaluation service for the Striveworks end-to-end AIOps platform.

Valor does the following:

Computes various metrics for different task types, including classification (for arbitrary data modalities), object detection, and semantic segmentation
Stores them centrally for discoverability, shareability, and query-ability
Supports defining data subsets using metadata to enable analyses, such as bias detection
Maintains model lineage so that metrics can be trusted, allowing users to see exactly what went into the metrics and how they were computed

Valor runs as a back-end service that users interact with via a Python client. For detailed information on setting up and using Valor, see the official documentation.

How Do I Use Valor to Understand Model Performance Bias?

Valor identifies model performance bias through its robust metadata and attribute filtering.

Metadata and Evaluation Filtering

To represent datasets, models, predictions, and ground truth data, the Valor Python client has the following fundamental classes:

valor.Dataset: Represents a dataset
valor.Datum: Represents a single element in a dataset, such as an image in a computer vision dataset, a row in a tabular dataset, or a chunk of text in a natural language processing dataset
valor.Model: Represents a predictive model
valor.GroundTruth: Represents a ground truth, linking an annotation with a dataset
valor.Prediction: Represents a prediction, linking an annotation with a dataset and model
valor.Annotation: Used to store ground truth and prediction class labels, bounding boxes, etc.

Using Valor, the basic workflow is as follows.

Create a valor.Dataset, which we will call dataset in the examples below.
Add valor.GroundTruth objects to it.
Create a valor.Model, which we will call model in the examples below.
Add valor.Prediction objects to it.
Call one of the task-specific evaluation methods on the model, such as evaluate_classification or evaluate_detection. In rough code, this looks something like:

from valor import Dataset, GroundTruth, Model, Prediction, Annotation

dataset = Dataset.create("dataset name")

dataset.add_groundtruth(GroundTruth(datum=Datum(...), annotations=[Annotation(...), ...]))

model = Model.create("model name")

model.add_prediction(dataset, Prediction(datum=Datum(...), annotations=[Annotation(...), ...]))

 

model.evaluate_classification(dataset)

One of the powers of Valor is that it allows all of the above objects to have arbitrary metadata associated with them. Users can filter metadata and attributes (such as class label or bounding-box size) to define subsets of data and then use those subsets for evaluation. This provides a means for quantifying model performance on different segments of the data.

Based on these metadata and attributes, Valor users can pass different types of filtering to evaluations.

Date and Time Filtering

Dates and times can be added as metadata, using Python’s datetime library. For example:

from datetime import datetime, time

from valor import Datum

 

Datum(

    uid=,

    metadata={"date": datetime(year=2024, month=2, day=12), "time": time(hour=17, minute=49, second=25)}

)

Then, if we want to evaluate the performance of an object detection model on images taken during the day, we would do something like:

model.evaluate_detection(

    datasets=dataset,

    filter_by=[Datum.metadata["time"] >= time(hour=8), Datum.metadata["time"] <= time(hour=17)]

)

Or, to know how a classification model performs for data since the year 2023, we would do:

model.evaluate_detection(
datasets=dataset,
filter_by=[Datum.metadata["date"] >= datetime(year=2023, month=1, day=1)]

Simple Data Type Filtering

The standard data types (int, float, str, boolean) and their filtering are all supported in Valor as metadata values.

For example, demographic information may be attached as:

Datum(uid=, metadata={"sex": "Female", "age": 62, "race": "Pacific Islander", "hispanic_origin": False})

Then, to evaluate how a model performs on all female- and Hispanic-identifying people under the age of 50:

model.evaluate_classification(
dataset=dataset,
filter_by=[Datum.metadata["sex"] == "Female",
Datum.metadata["age"] < 50, Datum.metadata["hispanic_origin"] == True]

Metadata can be attached to objects besides datums. For example, suppose we’re evaluating an object detection model for a self-driving vehicle, and we want to know how well the model performs on pedestrians in the road versus not in the road. In this case, we can attach a boolean metadata field to every person-bounding-box annotation and use this to filter object detection evaluation:

dataset.add_groundtruth(
GroundTruth(datum=Datum(…),
annotations=[Annotation(
task_type=TaskType.OBJECT_DETECTION,
bounding_box=person_bbox,
labels=[Label(key="class", value="person")],
metadata={"in_road": True}
model.evaluate_detection(dset, filter_by=[Annotation.metadata["in_road"] == True])

We explore this particular example in end-to-end detail in one of our sample notebooks.

Filtering on Geospatial Metadata

Valor supports GeoJSON dicts as metadata, which can then be filtered by geometric operations, such as checking if a point is inside a region or if two regions intersect. For example, suppose every piece of data has a location of collection. We can add this as metadata to the datum:

Datum(uid=, metadata={"location": {"type": "Point", "coordinates": [-97.7431, 30.2672]}})

Now, if we want to see how a model performs on data that was collected from a certain city, we can do the following (where city_geojson is a GeoJSON dict specifying the city):

model.evaluate_classification(
datasets=dataset,
filter_by=[Datum.metadata["location"].inside(city_geojson)]

Filtering on Geometric Metadata

Finally, for geometric tasks (such as object detection and segmentation), we can filter regions by geometric properties (such as area). For example, to evaluate an object detection model on bounding boxes with an area of less than 100,000 square pixels, we can use:

model_seg.evaluate_detection(
valor_dataset,
filter_by=[Annotation.raster.area < 100000]

A Tool for Understanding Model Performance in the Real World

Valor is a game changer when it comes to understanding model performance bias. By filtering model evaluations based on metadata and attributes, machine learning practitioners gain a world of insight into how their models perform on datasets and, crucially, different segments within a single dataset. Most importantly, this information is essential to understanding model performance in the real world.

We encourage you to experiment with Valor and let us know how you use it to evaluate your ML models. Check out the Valor GitHub repository to start using it in your machine learning workflows today, and read Valor’s official documentation to learn more.

Text Classification With LLMs: A Roundup of the Best Methods

Striveworks — Thu, 21 Aug 2025 17:06:40 GMT

This blog post is part of a series on using large language models for text classification. By Benjamin Nativi, Will Porteous, and Linnea Wolniewicz

The first two posts in our series on text classification using large language models (LLMs) each focused on specific approaches to working with these deep learning tools: supervised learning and unsupervised learning.

Our experiments have shown that both approaches can interact with LLMs successfully to classify text-if certain criteria are met.

But after all the experiments conducted by the Striveworks research and development team, how do they compare? Which method has proven most effective? How can you best use LLMs to classify text? Let’s find out.

Why Use LLMs for Text Classification?

LLMs are such a step change in how machine learning models interact with natural language that they have sparked a revolution in artificial intelligence applications. Millions of people now use LLMs every day to accomplish tasks like building a travel itinerary, writing an email to a colleague, and preparing a cover letter for a job application.

Because they are so adept at certain functions of natural language processing (NLP), we hypothesized that they may be effective at others. Text classification, used widely for spam detection, sentiment analysis, and other purposes, is one of the most fundamental applications of NLP. LLMs were not built for text classification, but their strength in NLP may enable them to perform the task at a high level. If so, they could spawn a new set of applications for this technology. Our experiments set out to put LLMs to the test.

What’s the Difference Between Supervised and Unsupervised Learning?

When looking to use LLMs for text classification, a range of options presents itself. What text should you use for training? Which models? How many data points do you need?

But the most fundamental choice is between supervised learning methods and unsupervised learning methods.

These two approaches make up the two most basic branches of machine learning. To put it simply, supervised learning involves training a model with labeled data. To perform supervised learning, a human needs to take time to annotate a training dataset and feed it into a model. As a result, the model learns the associations between the data points and their labels, which helps it accurately predict labels for similar data it has never seen before.

Unsupervised learning is different. In our case, we were attempting to coax accurate text classifications from a model that had not been trained for this purpose. Instead, it was leveraging the underlying patterns from our prompts to predict outcomes based on their text alone.

Both of these approaches have wide applications, from recommendation engines to face recognition to high-frequency trading. But how do they work with LLMs for text classification?

Supervised Learning for Text Classification With LLMs: A Review

In the first post in this series, we looked at our experiments with supervised learning. Specifically, we discussed how three distinct supervised learning approaches worked for text classification:

fine-tuning an existing model derived from BERT (a framework with a strong history in NLP)
fine-tuning an existing model using Low-Rank Adaptation (LoRA)
applying transfer learning to a classification head, which leveraged a trained model from another use case

Supervised Learning Results

For our experiments, we tested all three methods of text classification against the Text REtrieval Conference (TREC) Question Classification dataset-a common dataset used widely for training, testing, and evaluating the performance of text classification models.

As expected, all three of these methods achieved high F1 scores, showing that they can produce accurate results. Yet, the results were highly dependent on the amount of training data used.

The fine-tuned LoRA model consistently performed best across all training sizes, returning an F1 score of 0.8 or above, regardless of the number of data points it was trained on. However, the pre-LLM DistilRoBERTa model displayed poor performance when less training data was used. It returned F1 scores below 0.7 until it had ingested 512 examples from the TREC training set. The transfer learned classification head split the difference, starting with better results than DistilRoBERTa, even with only 16 training datums, and scaling to an F1 score of greater than 0.8 by the time it had seen 256 data points.

But F1 score isn’t the only metric to consider. While both the LoRA model and the classification head produced strong results, the LoRA model took much longer to train. Because of the design of the model, it took 50 minutes using a single GPU to fine-tune a LoRA model on 512 data points. Conversely, we were able to train the transfer learned classification head on the same data in only 2.56 minutes. This speed and reduced compute give an edge to transfer learned classification heads for text classification-unless you have access to a lot of computing power.

Unsupervised Learning for Text Classification With LLMs: A Review

Unsupervised learning offers an alternative to training LLMs for text classification, one with real advantages if effective. While fine-tuned LLMs can produce accurate results, there are challenges to using them-specifically, the need to work from an annotated dataset. Data annotation demands a lot of effort from humans, who must manually label hundreds or thousands of data points. If unsupervised learning with very little labeled data could produce good results from an LLM through prompt engineering, it would reduce the resources needed to deploy these technologies.

Our experiments in unsupervised learning focused on constructing prompts for LLMs to classify articles from the AG News dataset -a common dataset of news articles used for training and evaluating NLP models.

For these experiments, we crafted a standard instruction to use for all trial runs:

“Classify the news article as world, sports, business, or technology news.”

As part of our prompts, we also provided our test models with the headline for the article to classify, as well as a class field to elicit an appropriate response. For example:

“Article: Striveworks secures patent for innovative data lineage process

Class:”

For each article in the AG News test set, the prompt included the instruction, the article to be classified, and a cue for the model’s answer. We also conducted several experiments that involved k-shot prompting, where we included examples (aka “shots”) as part of the prompt-ranging from batches of four shots to 23 shots. Additional experiments aimed to determine whether or not different sampling strategies would increase performance without increasing computation. We also explored the effects of ensembling, a process that aggregates the results from multiple models or multiple runs using the same model. We tested several LLMs including:

Unsupervised Learning Results

Results from these experiments showed that simple, zero-shot prompts were unable to generate good text classification results from smaller LLMs such as Llama-7b, Llama-2–7b, Alpaca, and GPT4All. Zero-shot prompting did produce a significant F1 score above 0.8 with GPT-3.5, but this model has limitations as a proprietary model from OpenAI.

Incorporating shots into our prompts raised the F1 score when using the smaller, open-source models. Sixteen shots were necessary to initially return an F1 score above 0.8 from these models, and 23 shots were required to maintain that score reliably across runs. The ensembling experiments showed that ensembling raised F1 scores and reduced variations in performance, but they also showed that smaller ensembles with high shot counts outperformed large ensembles with low shot counts.

How To Use LLMs for Text Classification

So, what is the most effective means of classifying text using LLMs? It depends.

Our experiments showed that various methods of working with LLMs for text classification can deliver worthwhile results. As shown in Figure 1, four different unsupervised methods are capable of producing F1 scores that rival the accuracy of the transfer learned classification head.

Figure 1: Results of supervised and unsupervised learning experiments in using LLMs for text classification.

Can I just use ChatGPT to classify text?

If model deployment isn’t a concern, users can classify text using GPT-3.5. However, it was the worst-performing model in our tests.

What text classification method should I use if I want better results?

If data teams want stronger results, they need to deploy and maintain an LLM over which they have control, such as Llama. In our experiments, Llama proved more effective at text classification because we were able to directly interact with the machinery of the open-source model-allowing us to examine its logits, which we could not do with the closed, proprietary GPT-3.5.

Do I need to include shots in my prompts to get accurate text classifications?

Yes: Shots improve performance. When working with Llama and other LLMs, our findings show that users can produce accurate text classifications by including a large number of shots in their prompts-as many as possible given the available context size (for Llama, that’s approximately 23 shots per prompt).

Do I need to ensemble several runs to get good results?

Ensembling is very helpful but not at the expense of including shots in the prompt. As shown in Figure 1, ensembles of four with 23 shots consistently outperform ensembles of 16 runs that each included eight shots-suggesting that this is the most effective strategy for working with these LLMs.

Of course, a classification head trained on a full 120,000-point dataset produces the most accurate text classifications. However, this approach requires significant work up front to annotate a dataset. Ensembling four runs of 23-shot prompts generates an F1 score almost as good with only 96 labeled data points.

What text classification method should I use if I have labeled data?

Ultimately, if users have access to labeled data or they are willing to label more data to improve performance, supervised learning methods are the way to go. A fine-tuned LoRA model showed exceptional performance, and a classification head performed almost as well with much less strain on computational resources. However, even users without a lot of labeled data are in luck. K-shot prompting with ensembling-especially ensembles that include as many shots as possible-can also produce quite accurate results.

Even though LLMs are designed for generative purposes rather than text classification tasks, they are a robust new technology for broad NLP projects. They just require a commitment to experimentation-and a lot of shots.

Want to know more about using LLMs? Reach out today to learn how Striveworks can help you build, deploy, and maintain LLMs in both cloud and on-prem environments.