Service Monitoring Strategies

If a service falls in the forest and there is nothing around to notice, does the service exist?

By now you’ve likely heard myself or others chant the refrain that your services need good monitoring. This is especially true if you’re building out a microservice architecture. Knowing when your services are unhealthy is key to reducing the downtime of your systems.

But what does it mean to monitor a system?

Today I’ll describe three ways to monitor your services.

These are all strategies intended to notify you that your system has gone down in a production environment. They’re intended to find the bugs that slipped past your test suites or alert you when issues arise due to an external force (network issues, server dies, DDoS attack, etc.)

Metric Monitors

The first, and often easiest, way to know if your service is healthy is to watch the stream of data emitted from your application. These metrics can be directly published directly or derived from the logs generated while processing requests.

This strategy assumes that you have a place to aggregate these metrics. This could be a self-hosted instance of Prometheus or one of the numerous offerings from the likes of DataDog, Splunk, Sumo Logic, and more.

Once you have this data collected in a central place, you can set up monitors that watch for trends in your application and trigger an alarm if that trend is going in the wrong direction.

Some example metric monitors:

  • Number of HTTP 500 response codes within a time period reaches a certain threshold.
  • Size of a Dead Letter Queue (DLQ) is greater than 0
  • Size of a message queue is larger than some threshold
  • Response time for a service or particular set of API endpoints exceeds a threshold

In some cases, these are very binary values: Does the current value exceed the configured threshold. In other cases, you’re looking at a series of data over a time period, such as the number of error responses in the past, say, 5 minutes.

Figuring out which metrics to monitor and what thresholds can be something of an art and often requires some tuning over time.

Heartbeats

The second way to monitor your application is to continually make a request to it. This is often referred to as a heartbeat check or API/HTTP test.

The idea is that you set up something that will call a specific endpoint in your application and check that the response is a healthy, expected value. Sometimes this is a purpose built endpoint such as /healthcheck or it could also be a particular page on your website (e.g. the home page). In either case, you’re usually looking for a HTTP 200 response code and possibly looking at the contents of the response to make sure certain text or HTML elements are present.

How you set this up will depend on the monitoring system you are using. DataDog and NewRelic call these API Tests. In AWS you can set this up using a CloudWatch Canary.

Regardless of the system, you typically configure the endpoint to hit, how long you’re willing to wait for a response, and the expected response code and page contents. If the request fails after a certain number of attempts, the system will alert someone to investigate.

Synthetics

The third and most complicated mechanism for monitoring your application, is to use it like your users would and verify that important pathways are functioning as expected.

This is often called a Synthetic Test.

Setting this up is very similar to an integration test and might even use the same frameworks (e.g. Cypress or Selenium). The monitoring system makes multiple requests to the application driving a headless browser and verifies that each step along the way is working as intended.

Some examples of such a test include:

  • User can login
  • User can sign up
  • User can add products to the cart and checkout

These can be trickier to build as you don’t usually want to taint your production data with these test operations. To deal with this, you might:

  • Create a test user in the production system for the purposes of running these tests.
  • Clean up the test data after each run of the synthetic.
  • Add code to your application to ignore certain operations when it’s a synthetic test.
  • Exclude test data from your reporting or data warehouse environment.

While they may require more work, sometimes this type of test is the best way to understand that the critical paths within your application are working correctly.

Which to Choose?

So which strategy should you use?

The choice often comes down to a balance between cost to build and maintain and the criticality of the functionality being monitored.

  • Metrics are usually cheap. You’re often capturing them anyway and the cost of setting up a monitor is usually negligible.
  • Heartbeats tend to be low cost. The implementation is usually straightforward and the cost per execution is usually fractions of a penny.
  • Synthetics are the most expensive. They’re difficult to write, and often come with a higher cost to run.

Look at the important functions within your application. What would be the cost to the business if that functionality went down? With that information, you can choose whether metrics, heartbeats or synthetics is the right monitoring solution for you.

Question: What is your favorite way to monitor your applications?