How machine learning can drastically limit your downtime

02/06/2021 Tom De Blende

Did you know that the first gasoline-fueled cars were built at the end of the 19th century? At the time, there was no way of knowing when you were about to run out of fuel. If you wanted to know how much gas you had left, you had to turn off the engine, get out of the car, pop open the hood, and insert a dipstick to measure the tank’s fuel level. Can you even imagine the hassle? The first dashboard fuel gauges were introduced in cars in the 1920s, but it wasn’t until 1983 that the low fuel indicator made its debut. And then, it took many more years for cars to be equipped with the mileage predictors we know today.  Now, while mileage prediction isn’t quite powered by machine learning, the two do have a couple of things in common. How? Easy. The technology used to predict your mileage takes several factors into account. If you are suffering from a severe case of lead foot, for example, your mileage prediction will plummet.

From fuel gauge to disk space alert

Let’s use this trip down memory lane as a stepping stone. Imagine your car is a simple EC2 instance. How would you feel if you had to log onto your server and open up Explorer or run a df just to find out how much disk space you had left on your system drive? What would you think of having to check your dashboards every day? Would you be content with a virtual dipstick, as it were? 

Thankfully, this is not our reality today. Even traditional monitoring systems have long been capable of providing threshold-based alerts. Think of them as configurable low fuel indicator alerts for your systems.

Threshold-based alerts: a solid start

The question we need to ask ourselves is whether we are content with threshold-based alerts. A single gigabyte of free disk space might be plenty on one server and impending doom on the next. How do we fix this? Percentage-based thresholds are an interesting option, but they also come with a few caveats. While 5% might be ample free space for a 20TB file server, it would be problematic for a build server.

So you need to differentiate. And guess. Can you predict your system needs based on historical events and best practices?

Now, if there is one thing computers are better at than humans, it’s spotting patterns. They can learn from those patterns and deduce when things are “not normal”. Then, they will kindly alert you when there is cause for alarm. Much like the “miles to empty” feature in your car.

Putting machine learning to good use

And that is exactly what AWS DevOps Guru will do for you. The service uses machine learning techniques to spot anomalies in your environment and alert you before accidents happen. DevOps Guru uses machine learning models that were nurtured by years of Amazon.com and AWS operational excellence. You can use these models to identify anomalous application behavior (e.g. increased latency, error rates, resource constraints, etc.) and surface critical issues that could cause potential outages or service disruptions.


As an official AWS Managed Service Provider, Cloudar is constantly looking for ways to improve customer service. We help our customers to reach the highest levels of uptime. Or more specifically: the lowest levels of unplanned downtime, which is different, but I digress.

And while we still use traditional threshold-based monitoring tools (which do a great job in their own respect), using ML-based predictive monitoring and anomaly detection is something we have been doing for a while. With, I must admit, mixed feelings…

The predictive monitoring trap

Let’s circle back to our story about fuel. Imagine your car yelling at you to stop for gas every few hundred miles, only for you to arrive at the gas station and learn that the tank is half full. That is exactly the issue with a lot of tools that provide anomaly detection: false positives. 

At first glance, false positives don’t seem too bad. But trust me, they are. I would even argue that they can be as bad as missing a true alert. Any on-call engineer will tell you what happens when you receive too many false positives: you stop paying attention.

 

 

 

 

 

 


So yes, there is a lot of complexity in building an ML-based monitoring tool. And our experience as a launch partner (bragging rights here) is that DevOps Guru does a good job when it comes to limiting false positives. But of course, we are looking forward to seeing the tool grow even more.

Did my blog post spark your interest? Cloudar is hosting an EMEA DevOps Immersion Day on June 22 with a strong focus on DevOps Guru. The Immersion Days in the US were a big success and we are organizing this EMEA session with AWS to accommodate EMEA timezones. Sign up here.

  • SHARE

LET'S WORK
TOGETHER

Need a hand? Or a high five?
Feel free to visit our offices and come say hi
… or just drop us a message

We are ready when you are

Cloudar NV – Operations

Prins Boudewijnlaan 24B
2550 Kontich (Antwerp)
Belgium

info @ cloudar.be

+32 3 450 67 18

Cloudar NV – HQ

Veldkant 33A
2550 Kontich (Antwerp)
Belgium

VAT BE0564 763 890

    This contact form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

    © 2020 – CLOUDAR NV

    contact
    • SHARE