https://www.henrik.org/

Blog

Sunday, March 28, 2021

Building for high availability: Measuring success

Although highly available is easy to grasp conceptually it can be quite hard to define in practice. To be able to strive for higher and higher availability you will need to figure out how to measure it. To measure it you will need to define exactly how to calculate it.

A typical API service from client to service and back looks something like this. With the request starting at a client, traversing the internet before it hits the boundary of your service and then the response flows back the same way.

It is important to realize that any part of this chain can fail, and if it does, it will lead to a drop in availability as perceived by your clients. A large part of this you have no control over, and it is also fiendishly hard to even measure. If you only measure availability for requests in your service, you are missing a lot of potential failure modes. If one of your hosts goes bad it might not be able to report the metrics of failing requests or incoming network traffic stop all together.

It is often sufficient to measure availability from the first system that you have access to consistent logs from. This usually means either the gateway or if you are not using that a load balancer. If you are using Amazon API Gateway it can give you excellent request logs that are very useful for measuring availability and latency among other things. It will also emit Amazon CloudWatch Metrics that can measure availability directly both for the entire API and for individual methods.

How do you define availability?

The first thing you need to do is to separate out errors and faults in your metrics. An error is a request that could not be processed because of some problem with the contents of the request. A fault is request failure that is caused by a fault either in the communication chain or the implementation of the service. It is important to separate these out because as a service owner you have little to no control of the errors because they are due to a mistake in the client that calls you. Faults however do reflect your availability and are not dependent of mistakes made in the calling client. Worth noting though is that even though errors generally do not count against availability, they can if they represent errors that should not happen because a bug in your code. It is worth having some visibility into having an unusually high error rate.

If you are using HTTP to implement your API, errors should be any response status code between 400 and 499, faults are any status codes over 500 (Inclusive). Make sure that you implement your service to follow this pattern (Basically, do not invent your own usage pattern for the HTTP status codes). If you are using Amazon API Gateway, you get a metric for 4xx responses and a separate metric for 5xx responses. If you need better visibility into exactly what kind of error you are receiving, you can also set up a Amazon CloudWatch Logs Metric Filter on the request log from Amazon API Gateway.

How to calculate availability

Usually, availability is calculated as a percentage. This percentage represents the amount of traffic that is not faulty compared to the total request count. How exactly this percentage is calculated though is not as easy as it might sound and more on that in a bit.

When it comes to picking a goal for availability it is up to you as an engineer to come up with a goal that you are comfortable with. Another common pattern is that once you have implemented a proper availability goal and have good visibility into it on an ongoing basis you can always strive for higher by improving your goal incrementally. As an example, most Amazon Web Services have an availability Service Level Agreement (SLA) of 99.95% or higher. Most services can probably make do with a lower goal if you implement appropriate retries in your clients.

Simple Availability

The most obvious and simple way of defining this is to just use the ratio of non-fault requests divided by the total number of requests. With this definition your if you have a goal of 99.95% availability means that you should only have at most 1 faulty request for every 2000 requests. The advantage of this approach is that the value generally comes right from your metrics and is super easy to monitor and calculate. Using Amazon API Gateway this availability can be calculated directly from metrics emitted to Amazon CloudWatch Metrics. This is also a metric that is suitable for putting on a graph over time to visualize availability.

Calculating Availability for a Service Level Agreement

This way of measuring availability has its issues though because with this definition if you have otherwise perfect availability you can have an almost 4.5-hour long outage without breaking your 99.95% available goal for a year. But if you have a background level of continuous availability that is not perfect this does not generally negatively affect your consumers significantly, but it will significantly reduce the time you can have an outage before you have broken your goal. This difference become increasingly important once you start having an actual Service Level Agreement (SLA) for your service.

One way of addressing the shortcoming of the previous definition is to define your availability in the number of minutes you are above a certain minimum availability. An example of this definition would be that you measure your availability in the number of minutes you had an average availability over 99.99%. You can now have an availability SLA of 99.95% and in this case if your availability normally stays over 99.99% you get to use the full 4.5 hours long outage before you start breaking that SLA over a year. The bad news is that there is no easy way of calculating this metric without looking at each individual availability data point for every minute during the period. The same method can also be used with any period other than a minute.

Optimizing for client experience

If you are looking for the best experience for your clients though the previous methods still has their shortcomings. To illustrate this let us take an example where you introduce a bug that makes 100% of calls fail for 1% of your clients. In this example the way your API is used clients normally make an initial list request followed by 25 detail requests. But for the 1% of clients that get failing calls the initial list call fails. So for clients for whom the service works they make on average 26 calls where the failing clients only make a single call. In this case the simple available is 99 * 26 successful requests for every total 99 * 26 + 1 requests which translates to a simple availability of 99.96%. However, this hides the fact that 1% of your clients can not use your service at all.

The way to measure availability to catch cases like this is to define your availability goal per time period and per client. As an example, you can define the availability as the number of minutes where 99.5% of your customers have more than 99.99% availability. In the example above only 99% of clients have any availability which means that every minute is an unavailable minute by this metric until the bug is fixed. The bad news is that there really is no way of calculating this kind of availability without processing all your requests per minute to determine if you are in breach. So, it is by far the most complicated and expensive way of calculating availability. This method of calculation could potentially save you money for your SLA refunds though since if you apply it to SLA calculations you can keep track on which clients your service has breached the SLA on a per client basis instead of the previous method which would apply equally to all clients once in breach.

How to detect network outages

There is a problem with measuring availability by simply instrumenting the boundary of the service and that is what if you encounter an issue outside of that boundary. If your internet service provider suffers an outage it would stop all incoming traffic to your service. Your availability would still be 100% because there are no failing requests that you are aware of because they fail before they even reach a point in the communication chain that you can measure.

The solution for this problem is to create a canary that makes at least a minimum number of requests to your service in a way that imitates real client scenarios as closely as possible. This can be as simple creating a Amazon CloudWatch Events that triggers a AWS Lambda that generate traffic to your service. On top of this you need to add monitoring that alerts you when there is no traffic coming into your site. Ideally as your service grow you can trim this alarm to alert you when the traffic pattern goes below anything that is abnormally low instead of close to 0. That way you can also detect partial outages that are normally out of your control to measure. Furthermore, make sure that your canary emits metrics on the success of the calls it is making. Your canary traffic metrics will represent a true measurement of availability and latency covering the entire communication chain. It does only represent a small portion of all traffic, but it does properly measure all potential failures that a real client could encounter.

Latency as an aspect of availability

Even though technically latency does not affect your availability, it is extremely important for a good client experience. Latency can be hard to visualize. You might be tempted to believe that just taking the average of your request times will give you a good idea of what the latency of your service looks like. However, latency tend to have a very long tail and using the average generally is not best practice for ensuring that your clients have a good experience. As an example, below is the latency charted for a week of a sample service where it is aggregated as average, median (p50), p90 and p99. If you are unfamiliar with the pXX notation it denotes the percentile. The p99 graph represents how long the worst 1% request took to process.

As you can see in the example above there is a big difference between how you measure latency. The graph for maximum is cut off and goes all the way to 29 seconds in the worst-case scenario. In any environment with software defined networking and a decently high load you will be seeing strange outliers, so the maximum measurement is usually not very useful. Similarly, as you can see the average measurement can also hide issues that affect a not insignificant amount of your traffic. Using the p99 measurement to visualize your latency performance is usually a good middle ground. It includes enough of your worst behaving requests to see if you have significant issues with outliers taking a long time, but also ignores some of the more egregious network blips that can give extremely rare, but very high measurements otherwise skewing your graph.

When measuring anything using p99 aggregation another thing that is very important is the period under which you aggregate. You want to make sure that during the period you are measuring you have at least 100 measurements or more. If you do not, then p99 will be the same as maximum which leads to undesirable results. If you have at least 100 requests during the time period, you get to remove at least 1 request that is an anomaly before it affects your p99 measurement. If you have a minimum call rate of 1 call per second you will need to use a measurement period of at least 1 minute and 40 seconds or you will fall into this trap. Usually, you would use 5 minutes though if you do not have enough traffic to measure p99 for 1 minute though.

Finally, it is worth pointing out that each point in your service architecture will add latency. Same as with availability, it is important to measure the latency as close to the client as possible. Apart from using canaries you can rarely measure it from the client, but usually the gateway is a good place to collect latency measurements that are a good representation of your general client experience.

Create an availability dashboard

Your goal should always be to strive for higher and higher availability. To reach for this goal though you need to have visibility into what your current availability actually is. At minimum this requires you to monitor at the following on a continuous basis.

  • Availability - The number of faults divided by the total requests coming into your service.
  • Error rate - The rate of invalid requests that you are receiving. Even though this can be a false alarm, it can be an indication of a faulty deployment causing existing traffic to now fail if you see an unexpected change in the rate.
  • Transactions per second (TPS) - The number of requests coming into your service. The key thing you want to look at here is if there is a precipitous drop because that likely means a network failure that has occurred before you can measure it. A large, unexpected increase in traffic could also be an indication of a denial of service attack.
  • Latency - You should have goals on your latency and strive to decrease it. The way to have and keep these goals is to put them on a dashboard to make sure that you are aware of any changes in trends. If your service has different classes of operations that have significantly different latency profiles, you might consider separating each one out as a separate graph.

Below is an example dashboard that you can implement if you are using Amazon API Gateway as the gateway for your API.

Here is the definition of this dashboard in Amazon CloudWatch Dashboards. All you need to do is change the metric dimension of ApiName from YourAwesomeApi to whatever your API is called and reuse it. You might also need to tweak your minimum TPS limit and error rate amounts to something suitable for your traffic patterns.

  {
    "widgets": [
      {
        "height": 6,
        "width": 12,
        "y": 0,
        "x": 0,
        "type": "metric",
        "properties": {
          "metrics": [
              [ { "expression": "100*(1-m1)", 
                  "label": "Availability",
                  "id": "e1", "region": "us-east-1" } ],
              [ "AWS/ApiGateway", "5XXError", 
                "ApiName", "YourAwesomeApi", 
                { "id": "m1", "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Average",
          "period": 60,
          "title": "API Availability",
          "yAxis": { "left": {
            "min": 99.7, "max": 100, "showUnits": false, "label": "%"
          } },
          "annotations": { "horizontal": [
            { "label": "Goal > 99.95%", "value": 99.95 }
          ] }
        }
      },
      {
        "height": 6,
        "width": 12,
        "y": 0,
        "x": 12,
        "type": "metric",
        "properties": {
          "metrics": [
              [ { "expression": "m1 * 100", 
                  "label": "Error Rate", 
                  "id": "e1", "region": "us-east-1" } ],
              [ "AWS/ApiGateway", "4XXError", 
                "ApiName", "YourAwesomeApi", 
                { "id": "m1", "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Average",
          "period": 60,
          "title": "Error Rate",
          "yAxis": { "left": {
             "min": 0, "max": 10, "label": "%"
          } },
          "annotations": { "horizontal": [
            { "label": "Error Rate < 5%", "value": 5 }
          ] }
        }
      },
      {
        "type": "metric",
        "x": 0,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
              [ { "expression": "m1 / PERIOD(m1)", 
                  "label": "TPS", "id": "e1" } ],
              [ "AWS/ApiGateway", "Count", 
                "ApiName", "YourAwesomeApi",
                { "id": "m1", "period": 60, "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Sum",
          "period": 300,
          "title": "Request Rate",
          "yAxis": { "left": {
            "min": 0, "showUnits": false
          } },
          "annotations": { "horizontal": [
            { "label": "TPS > 20", "value": 20 }
          ] }
        }
      },
      {
        "type": "metric",
        "x": 12,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
              [ "AWS/ApiGateway", "Latency", 
                "ApiName", "YourAwesomeApi", 
                { "label": "p99 Latency" } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "p99",
          "period": 60,
          "start": "-P7D",
          "end": "P0D",
          "title": "Latency",
          "yAxis": { "left": {
            "min": 0, "label": "Milliseconds", "showUnits": false
          } },
          "annotations": { "horizontal": [
            { "label": "Latency < 1s", "value": 1000 }
          ] }
        }
      }
    ]
  }

Summary

Do:

  • Count faults against your availability
  • Have a canary to always have some traffic
  • Measure availability and latency as close to the client as possible
  • Have a dashboard that shows at minimum faults, errors, requests over time, and p99 latency

Don't:

  • Count errors against your availability
  • Aggregate latency on average, max or median.
  • Measure availability or latency from your service implementation

No comments: