Henrik Johnson's Blog: Designing for failure

One of the first things you hear when you learn about how to design for the cloud is that you should always design for failure. This generally means that any given piece of your cloud infrastructure can stop working at a given time so you need to design for this when constructing your architecture and gain reliability by creating your application with redundancy so that any given part of your applications infrastructure can fail without affecting the actual functionality of the website.

Here is where it gets tricky though. Before I actually started running things in a cloud environment I assumed this meant that every once in a while a certain part of your infrastructure (For instance a VM) would go away and be replaced by another computer within a short time. That is not what designing for failure means. To be sure this happens too, but if that was the only problem you would encounter you could even design your application to deal with failures in a manual way once they happen. In my experience even in a relatively small cloud environment you should expect random intermittent failures to happen at least once every few hours and you really have to design every single piece of your code to handle failures automatically and work around them.

Every non local service you use, even the once that are designed for ultra high reliability like Amazon S3 and Azure Blob Storage can be assumed to fail a couple of times a day if you make a lot of calls to them. Same thing with any database access or any other API.

So what are you supposed to do about it. The key thing is that whenever you try to do anything with a remote service you need to verify that the call succeeded and if it didn't keep retrying. Most failures that I have encountered are transient and tend to pass within a minute or so at the most. The key is to design your application to be loosely coupled and whenever a piece of the infrastructure experiences a hiccup you just keep retrying it for a while and usually the issue will go away.

Microsoft has some code that will help you do this as well which is called The Transient Fault Handling Block. If you are using the Entity Framework everything is done for you and you just have to specify a Retry Execution Strategy by creating a class like this.

    public class YourConfiguration : DbConfiguration 
    { 
      public YourConfiguration() 
      { 
        SetExecutionStrategy("System.Data.SqlClient",
                             () => new SqlAzureExecutionStrategy()); 
      } 
    }

Then all you have to do is add an attribute specifying to use the configuration on your Entity context class like so.

    [DbConfigurationType(typeof(YourDbConfiguration))] 
    public class YourContextContext : DbContext 
    { 
    }

It also comes with more generic code for retrying execution. However I am not really happy with the interface of the retry policy functionality. Specifically, there is no way that I could figure out to create a generic log function that allows me to log the failures where I can see what is actually requiring retries. I also don't want to have a gigantic log file just because for a while every SQL call takes 20 retries each one being logged. I rather get one log message per call that indicates how many retries were required before it succeeded (Or not).

So to that effect I created this little library. It is compatible with the transient block mentioned earlier in that you can reuse retry strategies and transient exception detection from this library. It does improve on logging though as mentioned before. Here is some sample usage.

      RetryExecutor executor = new RetryExecutor();
      executor.ExecuteAction(() =>
        { ... Do something ... });
      var val executor.ExecuteAction(() =>
        { ... Do something ...; return val; });
      await executor.ExecuteAsync(async () =>
        { ... Do something async ... });
      var val = await executor.ExecuteAsync(async () =>
        { ... Do something async ...; return val; });

By default only ApplicationExceptions are passed through without retries. Also the retry strategy will try 10 times waiting for the number of previously tries seconds until the next try (Which means it will signal a failure after around 55 seconds). The logging will just write to the standard output.

4 comments:

Sowmiya said...: thus this blog is really good just i got more information to your blog thus it is really nice and very much interesting.ya it is highlighting many important messages so that i like your message

Best Android Training Institute in Chennai; August 22, 2016 at 11:50 PM
tata spare parts said...: Replace with Confidence: Replace worn-out parts of your Tata car with our high-quality
Tata Spare Parts. Drive with confidence knowing you have the best!; May 26, 2023 at 3:45 AM
Tata Parts India said...: Looking for high-quality Spare Parts For Tata Xenon? Look no further than Tata Parts India. Find the best deals on genuine spare parts for Tata Xenon to keep your vehicle in top-notch condition. Shop now for the best selection and prices!; August 30, 2024 at 5:40 AM
Used Construction equipment For Sale said...: Partsmith offers New & Used Motor Graders for sale designed for road construction and land leveling projects. These machines deliver precise grading performance, making them essential equipment for infrastructure development and maintenance tasks.; March 9, 2026 at 5:13 AM

Henrik Johnson aka Mauritz Persson

Code Poet and Architect

Blog

Thursday, June 25, 2015

Designing for failure

4 comments: