https://www.henrik.org/

Blog

Monday, July 17, 2023

Why I created Underscore Backup

I started running a server for storing all my projects as well as various multimedia artifacts in 1999 with a small desktop computer and a 20GB HDD. As the size and personal importance of this server grew within a few years I started running RAID5 and then RAID6 to make sure data was not lost from single drive failures. Despite this, in 2006 the current incarnation of this server encountered a catastrophic 3-drive failure which I only managed to recover from after a tremendous amount of work and a fair amount of luck which included among other things manually patching the Linux RAID kernel code to remove certain fail-safes as I pulled data off the partially assembled RAID.


"The Server" in its current iteration.

This episode led me to look for ways to safeguard against this ever happening again. Looking through what options were available to me I found Crashplan which did address all my needs at a reasonable price. My initial backup to Crashplan took several years to complete over my 20mbit/s broadband uplink as my server had at this point grown to several TB.

A few years after I started using Crashplan they stopped offering consumer backups and the only way to keep using them was to migrate to their business plan which I did. However, Crashplan only allowed you to migrate a few TB per computer at the time which meant that I had to re-upload most of my backup again. Fortunately, at this point, I had gotten a fiber internet connection with a reasonable uplink that allowed me to re-upload this data in less than a year. As my backup of this server grew Crashplan also started showing its flaws where it required several GB of memory to be able to back up my server, but it did work and allowed me a reasonable peace of mind for the contents of my server.

This went on for a few years after which I was contacted by Crashplan (Now called Code42) and told that unless I reduced the size of my backup to under 10 TB, they would terminate my account since they considered me violating their terms of service by keeping too large a backup.

From: Support Ops (Code42 Small Business Support) 
Date: Feb 6 2020, 10:38 AM CST 

Hello Administrator,

Thank you for being a CrashPlan® for Small Business subscriber. We appreciate the
trust that you have placed in CrashPlan - that relationship is important to us.
Unfortunately, we write to you today to notify you that your account has
accumulated excessive storage, which will result in degraded performance. You 
have one of the largest archives in the history of CrashPlan. It is so large, we
cannot guarantee the performance of our service. Due to the size of your
archive, full restores of your backup archive, and even selectively restoring 
specific files, may not be possible.

As a result, we are notifying you, per our Master Service Agreement and
Documentation, to re-duce your storage utilization for each device to less than
10TB by June 1, 2020. Note that we have extended your subscription to June 1, 
2020 to give you ample time to make changes. If you do not do so by June 1, 
2020, your subscription will not be renewed, and your account will be closed at
the end of your current subscription term.

…

Thank you, 
Eric Wansong, Chief Customer Officer, Code42

The server I was using was Linux based and as far as I could tell Crashplan was the only competitor on the market providing cloud-based backup solutions for that OS. This was when I decided to start working on Underscore Backup as a means for me to continue making backups of my server as I couldn’t find any existing alternatives that fulfilled my needs. The first version was command line only and very primitive even though it did support point-in-time recovery, backup sets as well as obviously efficiently handling my very large backup. Another feature that was built in from the beginning was a strong focus on encrypting everything as much as possible so that any medium could be used for backups even if it was not properly secured from prying eyes. Creating the initial backup of my server using Underscore Backup used a more or less sustained 600mbit/s (To be compared with the at the time impressive 60mbit/s that I experienced using Crashplan on the same connection).

At the same time, I also started using the iDrive service for backing up my laptops and various other smaller Windows and MacOS based machines. I did this because I didn’t think the CLI (Command Line Interface) only implementation of Underscore Backup was just not convenient enough to be used on these machines). This situation continued for a few years when the CLI-only version of Underscore Backup backed up my server data to cloud block storage and my other machines were backed up by the iDrive service. This all came crashing down when my main development laptop of several years had a catastrophic SSD failure and I had to restore my data from iDrive. I found out two things about how the iDrive service works.

The first is that even though iDrive keeps track of versions of your files they do not keep track of directory contents and deletions of files. This is critical to any developer, and I restored a large developer repository with files that I have been working on as I have been running iDrive in the background. For those of you who are not developers, we rename files a lot. And every one of the old names of all my renamed files was restored back when I did a full restore of the contents of my laptop’s hard drive. That meant, that any repository of code that I had basically worked on since I started using iDrive was no longer in a buildable state without a considerable amount of work.

The second surprise to me was that even though to me the iDrive backup of my laptop was relatively small, only around 50GB in size it took almost 2 weeks to restore. Granted it contained a large number of files (Around 3 million, mostly small, files) but I was shocked at the slowness of its performance. I also opened up several support cases with iDrive about this but it was nothing they could do to help me. For comparison, on the same network with roughly the same sized backup in both files and total storage Underscore Backup would complete a similar restore in about 5 minutes (And it would do it properly keeping track of deleted files).

At this point, I evaluated other solutions available but could not find any that would be suitable for my needs. Carbonite does not allow you to specify what files should be backed up but instead in the interest of simplicity tries to be smart about it, when I tried it on my development files it decided to back almost none of them even though I specifically said to include the directory. Backblaze is a very solid solution but also does not keep track of deleted files for a true point-in-time recovery same as iDrive. In the end, I decided that I would put in the effort needed to create an easy-to-use user interface for Underscore Backup so that it would be suitable for use on things other than servers. The end result of these efforts was the first stable release of Underscore Backup in the summer of 2022 and which at that point graduated to be the only backup solution I used on all my computers.

The problem at this point though was that even though I had a backup solution that fulfilled all my needs it was still very tricky to set up for most users since to use it you generally had to supply your own cloud storage such as Amazon S3. It was also quite tricky to access data from other sources you had backed up since every source had to be set up individually on each client you wanted to restore the source on. The sharing functionality, even though present was also so complicated that I am relatively certain nobody managed to set this up except for myself. To solve all of these problems I decided to leave the service-less nature of the software I had followed up until that point and create a service to both remove the need to provide separate cloud storage and also help manage multiple sources and set up shares. This was a relatively large undertaking, but it eventually led to the launch of Underscore Backup 2.0 in the first half of 2023.

This current release as of this writing is the upcoming 2.2 release which has made it very easy to set up backup of multiple computers of any size while staying true to the original guiding principles of security, durability, efficiency, and flexibility.

This post was cross posted from the Underscore Backup blog.

Wednesday, April 26, 2023

Announcing Underscore Backup 2.0 and service general availability

First stable version of Underscore Backup with support for the companion service is now available. The service itself is also generally available.

Even with the new service, the main focus of the application is privacy, resiliency, and efficiency. The new service does significantly simplify setting up cloud backups and sharing though compared to use

The main new feature in version 2.0 is the introduction of a companion service that will help with many aspects of running Underscore Backup such as.

  • Keep all your sources organized in one place to easily restore from any of your backups to any other backup.
  • Help facilitate sharing of backup data with other users.
  • Optionally allow private key password recovery.
  • Easily access application UI even if running in a context where a desktop is unavailable, such as root on Linux.
  • Use as a backup destination. Storing backup data is the only feature that requires a paying subscription, giving you 512GB of backup storage for $5 per month.
  • Support multiple regions of data storage supporting North America (Oregon), EU (Frankfurt), and Southeast Asia (Singapore) regions to satisfy latency and data governance requirements.

On top of the companion service changes, the following features and improvements have also been implemented.

  • Added support for continuous backups by monitoring the filesystem for changes.
  • Introduced a password strength meter which requires a score of at least “ok” when setting up.
  • Switched from pbkdf2 to Argon2 for private key hashing function.

On top of the companion service changes, the following features and improvements have also been implemented. Get started by downloading the client now.

Tuesday, March 7, 2023

Building an online service on a shoestring budget

Photo by Josh Appel on Unsplash
Although I have been working professionally as a software engineer since I was 18 years old I have always had hobby projects I have been working with on the side and I generally take a somewhat perverse pleasure in figuring out how to build and launch these things on as small of a budget as possible. This post is an attempt to go through some of the things I have found that have helped me be productive and successfully build and launch several hobby projects.

I am particularly going to assume that this is for hobby projects and that the skill and time of the participants are free. If you are paying any salary that will dwarf anything you might save by aggressively using free tiers of online services. I am also going to assume your team is small (Less than 5).

What not to skimp on

First, let us go over the things you should not skimp on. The most important thing here is to not use any equipment or software from your day job. The reason for this is that if you do then your employer can usually claim ownership of any IP produced with their equipment. Also check your employment contract to make sure your employer doesn't have a clause to claim ownership to anything you do. That said, if you live in California, even if your employment contract does claim this it is not enforceable as long as you don't use company equipment, time, IP and you are not directly competing with your employer (See labor code 2870 for details).

One more of the things I would advise you to do is to enroll in school if you are not already. Being enrolled in a Community College only costs a few hundred dollars a year and will provide you with free licenses to a huge amount of tools for software development. Telerik, IntelliJ, Autodesk, and many more give students a free non-commercial license to almost their entire catalog of tools and libraries. Granted, once you get to the launch stage you will need to buy real licenses for your tools, but it will still save you tons of money in the development phase. You might even learn something doing it.

Basic development tools

I believe that if code isn't checked into a source repository with change tracking it basically doesn't exist at all. So, the first thing to do when starting a project is to pick a source code repository. GitHub is the giant in the field and they are fantastic. Not only do they give you free private repositories they also give you 2000 minutes a month of build executions (GitHub Actions). If you are building open-source applications you even get unlimited build executions for free.

Next you probably want to choose a cloud provider. I would pick one of either AWS or Azure. If you can go Serverless then I would go with AWS since they have a perpetual free tier for everything you need to launch a Serverless service. If not, then Azure Bizspark is a great program if you qualify. AWS also has a program for $300 to spend getting your prototype ready. Another tip for getting started on AWS is to get a new account for any new project. This is because they have an additional massive free tier that only lasts for 1 year after opening the account. It is also generally best practice to only run 1 microservice per account. Once the freebies are over you can tie your accounts together using AWS Organizations and SSO to help you keep track of them all (Doing this will usually invalidate the free tiers so wait a year after account creation to do this).

You also likely need a web UI testing tool. I use Cypress which has a free tier and is overall very good. They only allow 500 test suites per month so you can't run canaries in the free tier, but it should be sufficient for any deployment-based testing. They also provide a dashboard where you can see which tests have succeeded and failed with videos of the test execution so you can easily troubleshoot failures, something that is very useful when you integrate it into your CI/CD pipeline.

How to build your software

The key thing you want to avoid if you are launching something on the cheap is fixed infrastructure. If possible, use serverless functions instead of hosts or containers to host run your code. With some thought, almost everything you build can be run in a true pay-per-use manner. For instance, with AWS you should aim to use API Gateway, Lambda, SQS, and DynamoDB. As your service scales, you might consider moving off some of these for cost reasons but these primitives are also able to scale to thousands of transactions per second without any change to infrastructure if done right and none of them have a fixed cost. You generally don't want to use services such as Kinesis, Elasticache, Opensearch, relational databases, hosts, or containers since these all come with minimum fixed costs even if your service has no usage.

Useful services with good free tier

Here are a couple of other services worth noting with useful features and good free tiers.

  • Google Analytics is ubiquitous for site analytics. It is having privacy issues in the EU though with several countries declaring it illegal recently. Another option that I use that with more of a privacy focus is Clicky.
  • Also useful from Google is Firebase which provides a lot of features such as a basic user database, usage analytics, and monitoring among others. It is a great choice if your primary use case is a mobile app. It is pretty inflexible for building complex applications or services though and you probably want to go with a normal cloud provider for that.
  • Cloudflare is Web Application Firewall and has a very useful free tier. They also provide a privacy-focused and less annoying CAPTCHA service called Turnstile.
  • Blogger is a free blogging platform. It will generally not let you build your entire website like Wordpress will, but if all you need is blogging it does that well and allows you to use custom domains for free. 
  • Crisp is a great platform for providing support for your site and they have a nice free tier for getting started.
  • Auth0 provides a platform for helping you do auth of your users and has a decent free tier to get you started.
  • Most of the payment processors such as Square, Stripe, and Braintree only charge a percentage with no setup costs. Their fees are very similar, I prefer Stripe myself only because they have fantastic developer documentation.

Launching and running a service

As you first start out I tend to not think too much about schedules and deliverables. The reason for this is that I do this for fun and the best way to kill the fun is to start making yourself a slave to delivery commitments and launch dates. That said as you get closer to launch I really do think you need a way to keep track of remaining tasks and open bugs etc. In my opinion, Jira from Atlassian is by far the best and most comprehensive tool for this and as long as you have a small team everything you need is available for free.

You will need monitoring of your service before you go live. Both AWS and Azure have built-in monitoring tools and they work well. Also worth mentioning again in this space particularly is Firebase which does have some monitoring and analytics capabilities. Another service in this area that has a good free tier is New Relic. One thing that neither AWS nor Azure has is paging for when things actually go wrong. The tool that I found here that has a very functional free tier is Pager Duty, that said you are likely to want to upgrade from the free tier pretty soon as your service takes off to be able to have more control over your escalations.

Your service will likely need a single place to aggregate everything that is going on in one place such as task completions deployments and any issues and here Slack is hard to beat and have a great free tier.

Be frugal, not cheap

As a parting word, I would like to point out that although figuring out how to build and launch your service cheaply don't let that stand in the way of building your service right. Never pick the cheap option over the correct option, you will always regret it in the end.

For me, one of the main reasons why building things in a frugal way when I am working on hobby projects is that it allows me to have fun doing them longer because I don't have the pressure of needing to be done and launched fast because I am bleeding money during the development phase.

Being frugal during the development phase might also allow you to retain a larger portion of your equity if you actually launch your service because it will reduce the amount of help you will need to get started before you get a customer base. As an example, one of my previous projects Your Shared Secret literally has $0 per month of fixed cost. My more recent project Underscore Backup is not quite that cheap but has a fixed cost of less than $50 per month. Most of that cost is for CloudWatch alarms, KMS keys, and Dashboards.

Friday, February 24, 2023

Started another blog

Created another blog at https://www.mauritz.dev for more shorter snippets of what I am working on right now. Really its just something to put on this domain that I've had for a while now without doing anything with it.

Thursday, February 16, 2023

Launching Underscore Backup service and first beta of the version 2.0 of Underscore Backup application to use it

Finally launched the first public release of the Underscore Backup service. A backup service that is a companion to the Underscore Backup application I have now been working on for a few years. I am really excited about this since adding a service component to the application solves a couple of user pain points with my previous releases, such as.

  • Its now easy to coordinate and keep track of multiple sources so you can easily restore data from another computer you are backing up.
  • It comes with the ability to use the service for storage so that you don't have to deal with configuring S3 or something similar. The storage also supports 3 regions in the US, EU and the Asia Pacific region and is priced lower than S3.
  • It makes it easy to set up sharing between users.
  • The service can provide optional secret key recovery.
  • You can easily keep track of where the administration interface of the application is available even if the application is running in a context that does not have access to your desktop.

There is also a ton of other additional features included such as improve password hashing algorithms, log rotation, integrated password strength meter and built in new version notification.

You can head to the site to download the latest version and sign up for the service. The service is entirely free, however you do need a subscription for using storage as detailed on the pricing page.

Sunday, April 25, 2021

My quest for fiber provided by AT&T

When I moved to a new house about two years ago, I was disappointed to learn that there were no options for fiber-based internet in the area so I would have to take the step down to cable based internet. Fortunately, I was pleasantly surprised in September of 2020 that AT&T Fiber had added support of my area.

First try in September 2020

I ordered it as quickly as I discovered it, even though I was a bit hesitant about having people in my house as COVID cases were on the rise (I have a person in my household who is in a risk group). The person on the phone with AT&T assured me that all AT&T personnel involved with the install would be wearing a mask though, so I proceeded regardless.

The day of the appointment I was excited and had cleared my schedule. The first person to show up was not the technician, but just a salesperson that wanted to make sure I didn't have any trouble creating my AT&T account (Which I had already set up days before as per the instructions in the AT&T communication). This person also assured me that they get in trouble with AT&T if they do not wear a mask which felt reassuring to me.

About an hour later, still during the appointment window the installation technician showed up (Still within the assigned service window). We tried to figure out where the AT&T connection at my house was and eventually found it. Unfortunately, there was no fiber pulled to my house and it needed to be pulled around 100 feet from a neighbor's access point. He tried snaking the existing conduit but failed. He needed to call in a specialist that both had better snaking equipment and if that failed, they might have to do some digging to fix the conduit.

The second technician made an appointment and showed up around a week later. He also had a helper with him. They spent a good hour trying to get through the conduit. They also failed and I was told that they now would have to bring in a 3rd party company that would try again and might potentially had to dig a new conduit.

The third technician just showed up with no appointment. He also had no mask and did not have one to put on when I asked about it. I told him to come back when he had a mask and at an appointed time. After this interaction I called AT&T to complain and was told that the mask mandate only really applied to AT&T employees and since this person was a 3rd party contractor there was nothing they could do. At that point I told the representative that I only wanted this after I had gotten the expressed promise that everybody involved would wear a mask and since that was not true, I did now want to cancel the order. The AT&T rep told me that they did not have the authority to cancel my order, but instead that had to transfer me to a loyalty specialist. I told them that they could either cancel the order or not but that I would not open the door or let them on my property and hung up.

About a week later another man showed up from AT&T, also not wearing a mask. I told him that, no I had not ordered any AT&T Fiber and that he should go away through the closed door. A week after that an additional person showed up from AT&T, this time with a mask. I explained to him what was going on and he apologized and said that he knew how to make this issue go away for me. And in fact, it turned out that he did because that was the last I heard from AT&T for the time being.

In total this first try involved 7 visits from AT&T with a total of 8 people visiting my house.

Second try in March 2021

Skip forward to March 2021, as I am now vaccinated, I decided it was time to make another try. I also happened on an ad for AT&T Fiber with a good introductory offer, so I decided to try again with an order placed online on March 8th. I got an initial appointment for the morning of March 18th. About an hour after the appointed time with no visit I called AT&T customer support. I was told not to worry. The technician was just running late, and he was still on the way. After another 3 hours I called again and was then told that the technician had gone to the wrong house and since nobody at that house had ordered internet he had left. At no point during this had AT&T proactively reached out to me to let me know what was going on.

Slightly miffed I rescheduled the appointment for a week later. That day came around and no technician showed up that day either. At around 2 hours after the appointment window ended, I called support again. The first person told me not to worry, the technician was just running late. I told them about my experience last time and was put on hold to a second operator. This operator said the same thing. At this point I told the operator that this is no problem. However, if the technician does in fact not show up then they do not have to try again. At this point the agent transferred me again to a "Loyalty Specialist".

This third person that I spoke to did in fact do some digging and figured out that when the technician who had gone to the wrong house and left, he had in fact cancelled the entire installation. And me rescheduling it with the support agent did not actually reopen that ticket so there was no technician coming. He then proceeded to say that I shouldn't worry, he knows how to restart the process again properly. At that point I said "Thank you, but no thank you. You gave it the old college try but couldn't even get a technician to my house in 2 tries so I am done".

Third try using Sonic Internet

I had discovered that Sonic Internet also resold AT&T Fiber at my location and figured that at least in that case I would deal with a support department that was prompt and knowledgeable even though I would still have to deal with AT&T for the actual installation. The same day that I cancelled the second try with AT&T I ordered Fiber from Sonic instead.

The first appointment was scheduled on March 31st. Same as the original visit that I had in September a year earlier the conduit is broken and need to be fixed. This technician managed to get the specialist team to do the second visit the same day though. They showed up as a 2-person crew and told me in refreshing detail what was going to need happening next. First an underground survey needed to be performed after which a digging crew would be dispatched to fix the conduit. I should expect the survey to happen within a few days and then the digging crew would show up in a week or two.

On April 6th I had a second appointment scheduled at which time an AT&T technician showed up to install my internet with the assumption that the fiber had by this time already been installed. Of course, the underground survey had not even happened yet, so he had to leave without anything done.

April 8th 2 big trucks with a team of 5 people showed up. They started by taking an hour lunch and after that got down to the work of digging my conduit. When I pointed out that the underground survey had not yet been done, they got a bit flummoxed and told me that unfortunately they could not do any digging until this had been completed. But the foreman told me that he had put in a rush order to make sure the survey would get done as soon as possible.

On April 12th I got an email from Sonic telling me that they had been instructed by AT&T to check that my internet was working correctly. At this point of course, there had still not been any actual work done by AT&T, so I sent an email to Sonic support letting them know this.

On April 15th I got a notification from Sonic telling me that the installation of my internet had been scheduled for April 19th. Since this sounded strange, I contacted Sonic support to tell than that I am not currently waiting for an AT&T technician, but an underground survey. What I was told is that the last people who were here marked the installation as complete (Which is why I got a notification earlier in the week making sure my internet was working correctly) and because of that they now had to start over from the beginning. Which means that a person must first come out and assess that a dig needs to happen (So starting all the way from scratch again). Was told by Sonic rep that they had gotten into a discussion with AT&T that got so heated that the AT&T rep hung up.

On April 16th I got a visit from a cheerful AT&T customer service rep asking me how I was enjoying my new AT&T Fiber internet. She got an earful of what I thought of AT&T at that moment.

April 19th comes around and I get a visit from another AT&T customer service rep to help me set up my AT&T account. I explain the situation to him, and he promises to get on the phone with his manager to see if there is anything he can do to help. While he does that the AT&T installation technician shows up. The technician asks if I am speaking for Yvonne? I tell him that I have no idea what he is talking about and he tells me that his work order says that he there to install internet for an Yvonne from Florida through the third-party provider Earthlink (Not Sonic). There is literally nothing in the work order that is correct except for my address. I do manage to get the technician on the phone with Sonic support and both escalate the issue to their managers. In the end there is nothing AT&T can do to have the technician do the work, even though he is here. He has to come back at a later date when the order has been corrected. At this point the AT&T service rep steps back and says that he will take me under his wing and sort this out for me. I pointedly ask him if that means that I would become and AT&T customer instead of Sonic. He says yes and I politely refuse.

After AT&T leave, I spend some more time with Sonic support. They promise to get back to me when this is sorted out. While this is happening, on April 22nd, another AT&T technician shows up to do a fiber install for Yvonne of Florida. Later that same day I hear back from Sonic support and they tell me that they have sorted out the issue with AT&T and that I now have an appointment for April 27th (Next Tuesday) to get this process started.

To sum up

So far AT&T have made 15 visits to my house with a total of 21 people. That does not include the visit they made to the wrong house in the second attempt with AT& or the 2 appointments they scheduled for that try. There has been no progress made whatsoever to actually installing fiber and the person that is coming on the 27th is actually the first person AT&T is dispatching for this install from their perspective.

To be continued...

Sunday, March 28, 2021

Building for high availability: Measuring success

Although highly available is easy to grasp conceptually it can be quite hard to define in practice. To be able to strive for higher and higher availability you will need to figure out how to measure it. To measure it you will need to define exactly how to calculate it.

A typical API service from client to service and back looks something like this. With the request starting at a client, traversing the internet before it hits the boundary of your service and then the response flows back the same way.

It is important to realize that any part of this chain can fail, and if it does, it will lead to a drop in availability as perceived by your clients. A large part of this you have no control over, and it is also fiendishly hard to even measure. If you only measure availability for requests in your service, you are missing a lot of potential failure modes. If one of your hosts goes bad it might not be able to report the metrics of failing requests or incoming network traffic stop all together.

It is often sufficient to measure availability from the first system that you have access to consistent logs from. This usually means either the gateway or if you are not using that a load balancer. If you are using Amazon API Gateway it can give you excellent request logs that are very useful for measuring availability and latency among other things. It will also emit Amazon CloudWatch Metrics that can measure availability directly both for the entire API and for individual methods.

How do you define availability?

The first thing you need to do is to separate out errors and faults in your metrics. An error is a request that could not be processed because of some problem with the contents of the request. A fault is request failure that is caused by a fault either in the communication chain or the implementation of the service. It is important to separate these out because as a service owner you have little to no control of the errors because they are due to a mistake in the client that calls you. Faults however do reflect your availability and are not dependent of mistakes made in the calling client. Worth noting though is that even though errors generally do not count against availability, they can if they represent errors that should not happen because a bug in your code. It is worth having some visibility into having an unusually high error rate.

If you are using HTTP to implement your API, errors should be any response status code between 400 and 499, faults are any status codes over 500 (Inclusive). Make sure that you implement your service to follow this pattern (Basically, do not invent your own usage pattern for the HTTP status codes). If you are using Amazon API Gateway, you get a metric for 4xx responses and a separate metric for 5xx responses. If you need better visibility into exactly what kind of error you are receiving, you can also set up a Amazon CloudWatch Logs Metric Filter on the request log from Amazon API Gateway.

How to calculate availability

Usually, availability is calculated as a percentage. This percentage represents the amount of traffic that is not faulty compared to the total request count. How exactly this percentage is calculated though is not as easy as it might sound and more on that in a bit.

When it comes to picking a goal for availability it is up to you as an engineer to come up with a goal that you are comfortable with. Another common pattern is that once you have implemented a proper availability goal and have good visibility into it on an ongoing basis you can always strive for higher by improving your goal incrementally. As an example, most Amazon Web Services have an availability Service Level Agreement (SLA) of 99.95% or higher. Most services can probably make do with a lower goal if you implement appropriate retries in your clients.

Simple Availability

The most obvious and simple way of defining this is to just use the ratio of non-fault requests divided by the total number of requests. With this definition your if you have a goal of 99.95% availability means that you should only have at most 1 faulty request for every 2000 requests. The advantage of this approach is that the value generally comes right from your metrics and is super easy to monitor and calculate. Using Amazon API Gateway this availability can be calculated directly from metrics emitted to Amazon CloudWatch Metrics. This is also a metric that is suitable for putting on a graph over time to visualize availability.

Calculating Availability for a Service Level Agreement

This way of measuring availability has its issues though because with this definition if you have otherwise perfect availability you can have an almost 4.5-hour long outage without breaking your 99.95% available goal for a year. But if you have a background level of continuous availability that is not perfect this does not generally negatively affect your consumers significantly, but it will significantly reduce the time you can have an outage before you have broken your goal. This difference become increasingly important once you start having an actual Service Level Agreement (SLA) for your service.

One way of addressing the shortcoming of the previous definition is to define your availability in the number of minutes you are above a certain minimum availability. An example of this definition would be that you measure your availability in the number of minutes you had an average availability over 99.99%. You can now have an availability SLA of 99.95% and in this case if your availability normally stays over 99.99% you get to use the full 4.5 hours long outage before you start breaking that SLA over a year. The bad news is that there is no easy way of calculating this metric without looking at each individual availability data point for every minute during the period. The same method can also be used with any period other than a minute.

Optimizing for client experience

If you are looking for the best experience for your clients though the previous methods still has their shortcomings. To illustrate this let us take an example where you introduce a bug that makes 100% of calls fail for 1% of your clients. In this example the way your API is used clients normally make an initial list request followed by 25 detail requests. But for the 1% of clients that get failing calls the initial list call fails. So for clients for whom the service works they make on average 26 calls where the failing clients only make a single call. In this case the simple available is 99 * 26 successful requests for every total 99 * 26 + 1 requests which translates to a simple availability of 99.96%. However, this hides the fact that 1% of your clients can not use your service at all.

The way to measure availability to catch cases like this is to define your availability goal per time period and per client. As an example, you can define the availability as the number of minutes where 99.5% of your customers have more than 99.99% availability. In the example above only 99% of clients have any availability which means that every minute is an unavailable minute by this metric until the bug is fixed. The bad news is that there really is no way of calculating this kind of availability without processing all your requests per minute to determine if you are in breach. So, it is by far the most complicated and expensive way of calculating availability. This method of calculation could potentially save you money for your SLA refunds though since if you apply it to SLA calculations you can keep track on which clients your service has breached the SLA on a per client basis instead of the previous method which would apply equally to all clients once in breach.

How to detect network outages

There is a problem with measuring availability by simply instrumenting the boundary of the service and that is what if you encounter an issue outside of that boundary. If your internet service provider suffers an outage it would stop all incoming traffic to your service. Your availability would still be 100% because there are no failing requests that you are aware of because they fail before they even reach a point in the communication chain that you can measure.

The solution for this problem is to create a canary that makes at least a minimum number of requests to your service in a way that imitates real client scenarios as closely as possible. This can be as simple creating a Amazon CloudWatch Events that triggers a AWS Lambda that generate traffic to your service. On top of this you need to add monitoring that alerts you when there is no traffic coming into your site. Ideally as your service grow you can trim this alarm to alert you when the traffic pattern goes below anything that is abnormally low instead of close to 0. That way you can also detect partial outages that are normally out of your control to measure. Furthermore, make sure that your canary emits metrics on the success of the calls it is making. Your canary traffic metrics will represent a true measurement of availability and latency covering the entire communication chain. It does only represent a small portion of all traffic, but it does properly measure all potential failures that a real client could encounter.

Latency as an aspect of availability

Even though technically latency does not affect your availability, it is extremely important for a good client experience. Latency can be hard to visualize. You might be tempted to believe that just taking the average of your request times will give you a good idea of what the latency of your service looks like. However, latency tend to have a very long tail and using the average generally is not best practice for ensuring that your clients have a good experience. As an example, below is the latency charted for a week of a sample service where it is aggregated as average, median (p50), p90 and p99. If you are unfamiliar with the pXX notation it denotes the percentile. The p99 graph represents how long the worst 1% request took to process.

As you can see in the example above there is a big difference between how you measure latency. The graph for maximum is cut off and goes all the way to 29 seconds in the worst-case scenario. In any environment with software defined networking and a decently high load you will be seeing strange outliers, so the maximum measurement is usually not very useful. Similarly, as you can see the average measurement can also hide issues that affect a not insignificant amount of your traffic. Using the p99 measurement to visualize your latency performance is usually a good middle ground. It includes enough of your worst behaving requests to see if you have significant issues with outliers taking a long time, but also ignores some of the more egregious network blips that can give extremely rare, but very high measurements otherwise skewing your graph.

When measuring anything using p99 aggregation another thing that is very important is the period under which you aggregate. You want to make sure that during the period you are measuring you have at least 100 measurements or more. If you do not, then p99 will be the same as maximum which leads to undesirable results. If you have at least 100 requests during the time period, you get to remove at least 1 request that is an anomaly before it affects your p99 measurement. If you have a minimum call rate of 1 call per second you will need to use a measurement period of at least 1 minute and 40 seconds or you will fall into this trap. Usually, you would use 5 minutes though if you do not have enough traffic to measure p99 for 1 minute though.

Finally, it is worth pointing out that each point in your service architecture will add latency. Same as with availability, it is important to measure the latency as close to the client as possible. Apart from using canaries you can rarely measure it from the client, but usually the gateway is a good place to collect latency measurements that are a good representation of your general client experience.

Create an availability dashboard

Your goal should always be to strive for higher and higher availability. To reach for this goal though you need to have visibility into what your current availability actually is. At minimum this requires you to monitor at the following on a continuous basis.

  • Availability - The number of faults divided by the total requests coming into your service.
  • Error rate - The rate of invalid requests that you are receiving. Even though this can be a false alarm, it can be an indication of a faulty deployment causing existing traffic to now fail if you see an unexpected change in the rate.
  • Transactions per second (TPS) - The number of requests coming into your service. The key thing you want to look at here is if there is a precipitous drop because that likely means a network failure that has occurred before you can measure it. A large, unexpected increase in traffic could also be an indication of a denial of service attack.
  • Latency - You should have goals on your latency and strive to decrease it. The way to have and keep these goals is to put them on a dashboard to make sure that you are aware of any changes in trends. If your service has different classes of operations that have significantly different latency profiles, you might consider separating each one out as a separate graph.

Below is an example dashboard that you can implement if you are using Amazon API Gateway as the gateway for your API.

Here is the definition of this dashboard in Amazon CloudWatch Dashboards. All you need to do is change the metric dimension of ApiName from YourAwesomeApi to whatever your API is called and reuse it. You might also need to tweak your minimum TPS limit and error rate amounts to something suitable for your traffic patterns.

  {
    "widgets": [
      {
        "height": 6,
        "width": 12,
        "y": 0,
        "x": 0,
        "type": "metric",
        "properties": {
          "metrics": [
              [ { "expression": "100*(1-m1)", 
                  "label": "Availability",
                  "id": "e1", "region": "us-east-1" } ],
              [ "AWS/ApiGateway", "5XXError", 
                "ApiName", "YourAwesomeApi", 
                { "id": "m1", "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Average",
          "period": 60,
          "title": "API Availability",
          "yAxis": { "left": {
            "min": 99.7, "max": 100, "showUnits": false, "label": "%"
          } },
          "annotations": { "horizontal": [
            { "label": "Goal > 99.95%", "value": 99.95 }
          ] }
        }
      },
      {
        "height": 6,
        "width": 12,
        "y": 0,
        "x": 12,
        "type": "metric",
        "properties": {
          "metrics": [
              [ { "expression": "m1 * 100", 
                  "label": "Error Rate", 
                  "id": "e1", "region": "us-east-1" } ],
              [ "AWS/ApiGateway", "4XXError", 
                "ApiName", "YourAwesomeApi", 
                { "id": "m1", "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Average",
          "period": 60,
          "title": "Error Rate",
          "yAxis": { "left": {
             "min": 0, "max": 10, "label": "%"
          } },
          "annotations": { "horizontal": [
            { "label": "Error Rate < 5%", "value": 5 }
          ] }
        }
      },
      {
        "type": "metric",
        "x": 0,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
              [ { "expression": "m1 / PERIOD(m1)", 
                  "label": "TPS", "id": "e1" } ],
              [ "AWS/ApiGateway", "Count", 
                "ApiName", "YourAwesomeApi",
                { "id": "m1", "period": 60, "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Sum",
          "period": 300,
          "title": "Request Rate",
          "yAxis": { "left": {
            "min": 0, "showUnits": false
          } },
          "annotations": { "horizontal": [
            { "label": "TPS > 20", "value": 20 }
          ] }
        }
      },
      {
        "type": "metric",
        "x": 12,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
              [ "AWS/ApiGateway", "Latency", 
                "ApiName", "YourAwesomeApi", 
                { "label": "p99 Latency" } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "p99",
          "period": 60,
          "start": "-P7D",
          "end": "P0D",
          "title": "Latency",
          "yAxis": { "left": {
            "min": 0, "label": "Milliseconds", "showUnits": false
          } },
          "annotations": { "horizontal": [
            { "label": "Latency < 1s", "value": 1000 }
          ] }
        }
      }
    ]
  }

Summary

Do:

  • Count faults against your availability
  • Have a canary to always have some traffic
  • Measure availability and latency as close to the client as possible
  • Have a dashboard that shows at minimum faults, errors, requests over time, and p99 latency

Don't:

  • Count errors against your availability
  • Aggregate latency on average, max or median.
  • Measure availability or latency from your service implementation