Stories by Anomify on Medium

How Monitorama changed our lives — a decade on by Gary Wilson (@earthgecko)

Anomify — Wed, 28 Jun 2023 19:11:09 GMT

Monitorama 2013 logo

How Monitorama changed our lives — a decade on
by Gary Wilson (@earthgecko)

I virtually attended Monitorama in Boston over Easter in 2013. It was an event that was to change my life and set me on a path, on which a decade later, I am still walking.

For an infrastructure engineer scaling a rapidly growing video adtech platform, Monitorama 2013 was choke full of awesome things. By the end of the weekend I had deployed Graphite, statsd, sensu and riemann

20130404 — Class #9489: graphite
20130406 — Class #9515: statsd
20130406 — Class #9519: sensu
20130407 — Feature #9529: sensu_graphite (Extended the sensu-puppet class to include amqp support for transporting at scale to graphite. Pattern.)
20130407 — Feature #9531: real_time_graphite (#monitorama #monitorallthethings @1second_resolution)
20130413 — Class #9631: riemann

It was a crazy busy month and it started out with Monitorama at the time we were needing to start seriously scaling. We had our own data centre racks, but already in Feb of 2011 we started running our new video adtech services on the clouds, AWS, Rackspace and Linode to start. All fully automated, node creation via libcloud and node manifest commits to Github triggered node creation, bootstrapping and then puppeted into existence, flawlessly most of the time. puppet-provisioner was nascent at that stage only dealing with AWS, so with libcloud we tamed to clouds.

It was heady a ride and a rush, 16 hour days, machines all over the place, scale, scale, scale.

April 2013 saw us cross a rubicon, where our data centre monitoring set up was no longer fit for purpose, we were struggling to keep up with adding cloud nodes and things to cacti, collectd, awstats, nagios and our pfsense firewalls, some of the data center being automated (or partially automated) with puppet and some not.

Even though in April we were scaling mad, by adding a 4th large DB node our multi region cloud DB cluster, adding 3 READ slave DBs in 3 regions to spread the load, adding web servers 11 thru 18 to the web nodes, we also took on adding a modern monitoring layer. Through April we also added Graphite, statsd, sensu, riemann, rsyslog, logstash and elasticsearch, kibana to our stack along with 11 additional nodes to service these. Which brought a whole slew of new problems to deal with like “elasticsearch — java heap space — OutOfMemoryError”, etc, etc.

But all in all, just like magic, with some puppet runs, we had telemetry on everything, #allthethings — stats.varnish, stats.memcache, stats.apache, stats.lighttpd, stats.redis, stats.mysql, stats.haproxy, etc.

We had time series data. We even quickly turned elasticsearch queries and riemann data into time series data.

We made dashboards.

We made a number of scripts to alert on various metrics from Graphite.

That year flew by, more scaling, more things. It was only mid October that I got to the last component of my Monitorama stack wishlist which was Etsy’s Skyline which in had hit my radar from our implementation journey. I guess that was fitting seeing as we needed all the data first to feed Skyline with.

After seeing Abe Stanway’s slidedeck from September 2013 in Berlin, my interest peak and by October we realised that collecting all this data is awesome and having telemetry on everything, by now dev had added statsd to all our web apps, but what do you do with it all? How do you know in the 1000s upon 1000s of metrics what is performing badly or unusually? Things still did bad things and we could see it in the graphs AFTER the facts, that is if you could even find it or knew what to look at or for…

I deployed a Skyline node. It took a while to get to grips with Skyline and as David Gildeh once said, “I still remember taking Skyline and applying it to one of our customer’s metrics, and turning 100,000 metrics into 10,000 anomalies. It just created more noise from the noise.”

We found the same and so did Etsy, ultimately finding one size did not fit all.

However, the potential of Skyline was not lost on me it was the ONLY thing in the stack that was ACTIVELY monitoring everything! There was noise, but there were signals too!

In a multi-cloud, multi-region architecture with 100s of different dimensions in ad metrics, 100s of publishers, 100s of campaigns, realtime bidding, multiple 3rd party services and all spread over 13 data centres globally… not to mention the 1000s of server and application metrics, for a 2 person ops team, finding signals was gold!

The noise, what to do about the noise?

By December, having run Skyline for 2 months and assessing its performance, I had come to the simple conclusion that 24 hours of data was not enough. Skyline made me look at all the things, things I had no idea what they were, meaningless metrics I had no understanding of. But whenever I looked, I looked back 7 days, 30 days, “Is this unusual?, If so why?” (more often than not also thinking “wtf is this? What does it represent?”).

My behaviour over the 2 month period was:

Skyline alert -> open Graphite look at 7 days of data -> does it look normal or anomalous?

The result was that a lot of the time, if not most of time, at 7 days the metric did not appear to be unusual or abnormal, over 7 days that spike/peak/trough occurs quite frequently, it just does not occur frequently in a 24 hour period. So not anomalous. But damn I learnt a lot about our things and their metrics!

If you have to do something often, automate it and in a time of automation…

Even though I had zero Python knowledge, other than some very simple libcloud scripts I had crafted together, I thought the proper way to automate that was to make Skyline do exactly that, instead of alerting, grab 7 days of data and analyse that and if the behaviour is still anomalous, then alert.

skyline/mirage

Skyline somewhat tamed. Those 10000 anomalies became 100s of anomalies, much less noise, more signal. We pruned back the volume of metrics we were sending and made sure all our critical revenue affecting metrics were pushed through and made more of those and Skyline stayed. It was not perfect, but it was definitely much more than having nothing!

On to the next problem. “You cannot change statistics” (or algorithms) they will find what they find (or they will drift). This is quite a difficult problem and probably one of the key problems in anomaly detection. What the algorithms decide are anomalies are not necessarily what you deem to be anomalous. Change the algorithms? The other algorithms may not agree on the same anomalies, but for certain you will often not agree with what the other algorithms class as anomalous either.

Solution — we need a human in the loop, saying “that is not anomalous”. How? If we cannot train statistical algorithms or blackbox ML algos, we cannot provide “normal” training on EVERYTHING, so how do we “teach” the system when these are the things it is using?

Solution — pattern matching (DIY algo), similarity searches (SOTA algo). These work! In the world of machine learning achieving an error rate of ~0.34% on a method is :mind-blown:

So 10 years on and it is still not perfect. After literally spending a decade trying to make anomaly detection better, I can tell you it is no small feat. The interesting thing is that anomaly detection on time series data is one of the most difficult domains in machine learning and data analysis.

You can test and add more and more SOTA algorithms but anomaly detection is not a problem that can be solved with a silver bullet, but it is something that can be made better, inch by tedious inch, by investigating and testing in feet, yards or even miles at time to gain but a single inch at a time, but it is possible. VAMOOS — Visualise, Analyze, Modelize, Over and Over until, Satisfied (courtsey of Dr Neil J. Guther 2013 talk)

A decade on and I believe that anomaly detection is and always was and will be part of the roof that the pillars of observability are there to hold up.

We have spent a decade perfecting how we collect observability data, we have massively increasing the observability data that we collect, increasing and improving the volume, I/O, storage, query and response times, etc, etc. We are now drowning in data and what are we doing with it all?

A decade on for Monitorama 2013 and we still have monitoring problems.

An emerging feeling in the community seems to be that observability has a problem, we have binged on metrics (logs and traces) because we can, because it become standard to instrument everything, we have go application metrics up the wazoo, every developer adding tons of metrics for debug, just in case, because they can or think they should, because of k8s, because of pods, because of churn. When was the last time you saw a metric labelled important or critical in the INFO metadata? Maybe the developers often do not know themselves.

Do we have so many metrics now that is almost impossible for us to exactly know what ones are important or may be important at times in the sea of metrics we now collect?

But there are important metrics in the sea of metrics.

What are your important metrics?

We think in SLIs and SLOs, but how often do our SLOs incorporate internal app metrics? We have ring fenced performance and anything outside those fences is rarely contemplated, just some leaf metric in a forest of trees with 1000s of leaf metrics per tree. We scale to a million time series per second because we can, not because it is necessarily useful to do so. How much utility do we get per metric? Perhaps we should cost it that way? How many data points get collected, run through the retention period and get pruned without ever being looked, queried or analysed? Odds are it is a very large percentage of those that are collected. How many devs, ops or SREs even know and/or understand what all the the metrics are? Most of us do not.

Isn’t it time we started constantly putting that data to use? Stop thinking about alerting and start thinking about a change event stream that you can look at and learn from. Anomaly detection is NOT about alerting, it is about information, change, insights and learning.

You cannot do useful real time anomaly detection on all your millions of metrics. Well you could but it would cost a fortune and why would you want to? Figure out and decide what is important at least and put more of your data to work.

I had such high ambitions for this blog post, you can get so excited at times trying to share some pearls of wisdom with the world, imagining your words will change everything. That is rarely the case and you realise that you cannot change the world with a blog post, it is impossible to convey to stoke you feel, the insight and knowledge you have of something that is abstract and hard to explain. “No one wants to hear about your anomaly detection”, which is a pity because it can teach us all so much.

I should have done a Monitorama talk, because those can change the world.

Leaping up the reliability ladder — jumping from step 1 to 5 in one giant leap

Anomify — Fri, 31 Mar 2023 15:42:22 GMT

Leaping up the reliability ladder — jumping from step 1 to 5 in one giant leap

In 2022 Steve McGhee and James Brookbank from Google published a roadmap for Reliability Engineering — https://r9y.dev/

The roadmap is a simple tech tree that can be implemented in various aspects to achieve the appropriate number of “nines” for your organisation.

The roadmap divides the maturity of a reliability engineering team into stages. Those organisations with the tooling and processes described in the lowest tier should be aiming for 90.0% uptime. Organisations with the tooling and processes described in the top tier should be aiming for 99.999% uptime.

The roadmap places anomaly detection in the highest and last stage, aligned with the observability tier which would be deployed by a “99.999 Well engineered business” as part of an autonomic system (a system that provides self-healing and self-protection capabilities) which accepted an unreliability of 5.26 minutes/year.

That is a long road, but it does not have to be. Although anomaly detection will not get you three 9s on its own, it is a component that definitely helps you to get there. One of the reason that it probably sits in the last stage is because the authors are inferring that it needs a lot of the other stages in place before you can implement it, such as telemetry collection, and automated host provisioning and configuration.

Another reason it sits in the last stage is because besides the automated infrastructure and telemetry that are required, it also requires specialised and skilled staff that can undertake implementing and doing anomaly detection and it is unrealistic that smaller orgs, lower down the ladder would have those types of employees on that part of their journey.

Luckily to actually implement anomaly detection you really only need HALF of the first step on the observability tier, HOST METRICS. With host metrics alone you can start to do anomaly detection. It does help if the host metrics are sent by an automated process, but even if your infrastructure is not automated you can still configure your things manually to send the host metrics somewhere.

Ironically it was host metrics that enabled Anomify to climb the reliability engineering ladder. We built our anomaly detection specifically to give us visibility and monitoring on 10s of 1000s of metrics from 430 hosts and their applications, spread across 13 data centers globally and serving up to 6.4 million ad requests per minute with realtime bidding.

We needed to develop a cutting edge internal anomaly detection platform in order to identify and understand changes in our globally distributed ad platform. With 4 different cloud providers and 100s of partners and customers who could all cause significant changes either in error or intentionally (friendly fire) via launching exceptional campaign traffic, publishing an incorrect tab or the Hong Kong data center being network partitioned, anomaly detection was the only technology able to keep tabs on it all. This was the only way to identify, pinpoint and understand vectors of change in a large, global and very dynamic platform, especially with a small ops team of two, and then one.

With Anomify you can jump directly from step 1 on the observability reliability engineering ladder to partially fulfilling step 5! Anomaly detection alone will not give you 3 9s but it will give you information about changes in your things like you were a 3 9s org. A virtual member of your SRE team. No SRE team? Well then call it your own virtual SRE team member that keeps track of all the significant changes for you.

You do not need to be a mature, well engineered business to have and use anomaly detection, you just have to be on the road, at any stage of that journey.

Anomaly detection for everyone.
https://anomify.ai — stay on top of your metrics

A comparison of unsupervised anomaly detection algorithms for time-series

Anomify — Wed, 01 Mar 2023 17:35:57 GMT

Anomaly detection is currently a fairly hot topic in many areas, including SRE, finance, observability, platform engineering, IoT and social infrastructure, the list goes on. Today all the main cloud providers have added some kind of anomaly detection offering to capitalise on this trend. AWS have anomaly detection offerings in cloudwatch and SageMaker, etc. Google Cloud Platform are offering it via a ML.DETECT_ANOMALIES ARIMA_PLUS time series model. Microsoft Azure Machine Learning have an Anomaly Detector based on the Spectral Residual algorithm.

With all the interest in anomaly detection in recent years, there is a lot of hype too. Anomaly detection is often misinterpreted as something that tells you when something is wrong. This is a very common misconception, in reality anomaly detection identifies significant changes. In here lies the rub of anomaly detection, what is significant?

Another misconception and often in the hype, is that there is some perfect formula or algorithm. This is just not true. There is no magic bullet in anomaly detection, there are only varying degrees of insights. This article sets out to provide the reader with a realistic view of the results of various algorithms and what one can expect from anomaly detection. It is not about saying X is good or Y is bad, neither in the sense of the data or the algorithms, it is about demonstrating what anomaly detection can output.

Running a number of the popular state-of-the-art unsupervised algorithms against normal types of application and server/machine metrics it is possible to compare these algorithms in terms of their detection rates.

It is important to state from the outset that this is not a benchmark of any kind. The algorithms have not been specifically tuned, other than some standardisation or transformation of the data where required by the algorithm. Algorithms which have specific hyperparameters have just been set to default or are automatically calculated. Furthermore not all of these algorithms are suited to running in real time on streaming data, they are included to demonstrate detection patterns.

The bag of algorithms being used here are:

Robust Random Cut Forest
Local Outlier Factor
One Class SVM
DBSCAN (Density-based spatial clustering of applications with noise)
PCA (Principal Component Analysis)
Prophet
Isolation Forest (with a contamination setting of auto and 0.01)
Spectral Residual
anomify.ai analysis

These cover a wide range of popular unsupervised algorithms but are in no way exhaustive, there are of course many other state-of-the-art algorithms which are not sampled here, however as with all of them they generate varying results.

Unstable time-series

To begin with a time-series which would be classified as an unstable is analysed to demonstrate at what points the various algorithms identify anomalies.

Obviously one of the points of this article is to demonstrate how the analysis we use at Anomify compares to other methods. The same data as it was analysed in real time via Anomify produced the following results.

Stable time-series

Next the analysis of a time-series which would be classified as an stable.

Another stable time-series

With a bit more volatility.

The above illustrations show that both DBSCAN and OneClassSVM algorithms have failed due to the inappropriate default parameters and epsilon value for DBSCAN, which were not suitable for the specific time-series data. This highlights the challenge of using unsupervised algorithms with multiple tunable parameters. Such algorithms may perform well on some data with one set of parameters and poorly on another with the same parameters. Even if efforts are made to automatically determine optimal parameters, the variety of patterns in metrics may result in overfitting or underfitting. This makes unsupervised algorithms especially challenging for detecting patterns in a large metric population.

Seasonal time-series

Next a seasonal time-series is analysed with anomalies (and matches), here we introduce a concept of matching with more on that below.

In the above graph matches can be seen. Matches are instances that were classified as potentially anomalous but then reclassified after having matched very similar patterns in data that the system has been trained on or learnt as normal.

A review of the results

The results demonstrated here make it quite clear that some unsupervised anomaly detection algorithms can generate a fair amount of noise and the results across different algorithms can vary quite dramatically. Most are usually correct in some context, however no single algorithm or method is going to achieve useful insights out the box on all types of data. Although some algorithms do have “silences” built in to them, e.g. only flagging a single instance in X period as anomalous, many do not which can lead to excessive false positives when applied in real world settings.

Unsupervised, supervised, statistical, ML and deep neural network detection methods

Even though unsupervised detection methods are not perfect and on their own in isolation are less than ideal for real world application, they are an absolute requirement in large metric populations and in these settings they are required to have a very high performance. Some of the reasons unsupervised detection is required in this type of setting are:

It is not feasible to identify and train everything so that supervised detection can be deployed.
Supervised models drift and need to be updated and validated frequently.
Supervised methods generally need hyperparameters tuned, often specifically to the type of data.
Supervised methods and machine learning tend to be expensive in computational time complexity.
Supervised methods often require additional storage for models and incur increased latency when models need to be loaded, updated or created.

Although machine learning and neural networks have gained a lot of attention from the artificial intelligence in recent years and do bring exciting new possibilities to time-series anomaly detection, in the unsupervised arena, statistical methods still generally out perform these methods on point and collective anomaly detection. This is also true in terms of analysis speed, fast training and prediction times.

Getting a pipeline of suitable unsupervised methods is key to making anomaly detection useful and further extending that with additional methods is the next evolution.

Semi-supervised methods

After highlighting the challenges that all unsupervised anomaly detection methods pose, the next part is to look at how semi-supervised methods can come to the rescue. Where false positives are identified as normal and the data is used as training for semi-supervised methods. This is where the concept of matches comes into play and how a little training can result in a massive reduction in false positives.

Be sure to check out our next blog post in this series:

The way forward — semi-supervised anomaly detection

To find out more about anomaly detection for time-series metrics check out Anomify here.