https://www.henrik.org/

Blog

Sunday, January 25, 2015

Choosing your cloud provider

When you start any coding project you generally need some sort of server capability even if the application your building is not a web site. When choosing your cloud provider there are several different things to think about.

First of all you need to consider if what you need is very basic and will not require a high SLA or the ability to grow with usage you are probably better off choosing a virtual private server provider. If you are fine with a Linux box these can be had extremely cheap. I used to be a customer of Letbox and at the time they provided me with a Virtual Private Server for $5/month, a price that is hard to beat. It is however important to realize that this is not a true VM, it is a specialized version of Linux similar to doing a chroot but also with quotas on memory and CPU usage. This means that these VM:s can only run Linux. That said the price is simply in a league of itself, cheaper usually than even spot instances of AWS EC2.

However once you have something slightly more complicated to run you probably want to go with a "real" cloud provider. These come in two kinds. The first level are companies providing infrastructure as a service (IaaS). This means basically providing virtual machines, storage and networking for them. It is up to you to build everything you need to run off of these primitives. Companies that offer only this kind of computing includes Skytap, Rackspace (Although Rackspace does have some platform services) and many more.

The next level up are the companies that provide platform as a service (PaaS). All of these companies also provide the infrastructure as well if you need it, but on top of this they provide useful packages that they will run for you as managed services to help creating, deploying and running your services easier. These minimally usually include.

  • Load balancing of traffic to multiple servers.
  • Auto scaling of new servers to handle varying load.
  • A fast and scalable NoSQL key value store.
  • A managed transactional database.
  • Web hosting.

There are as I see it three main players in this space and they are Amazon Web Services (AWS), Microsoft Azure and Google App Engine.

Of these Amazon is by far the largest. AWS started out as a mainly platform as a service offering, but now has one of the most complete set of managed services and they have by far the largest set of data centers located all around the world and one region qualified for US government work loads (Having an account on it requires you to be a US citizen so I can not use it). Their infrastructure is truly top notch, but their development tools are not great. Only a few languages have an official SDK (I myself have been missing an SDK for PERL).

Microsoft approached this space from the opposite direction from Amazon and started out by offering specific platform solutions and tightly integrating the development and deployment of Azure applications into their development tool Visual Studio. It is the only cloud provider I am aware of that for a time did not provide IaaS at all (Although they do now). The SDK and tooling for all of their products is truly excellent, especially if you are a .Net C# developer, but many other languages are supported as well. They do unfortunately and understandably run most of their infrastructure on Windows which simply is not as solid as other hyper-visors out there and if you are building a solution that requires reliable quick processing this can be a problem, especially if you have a cluster of synchronized machine this can become really problematic. These synchronization issues usually only occur a few times a month though as the service is migrated to new machines as all the VM:s running the service undergo the monthly Windows patching. However as long as your application does not rely on tight synchronization between several systems you are unlikely to notice it.

Finally there is Google. Their solution is similar to Amazon something that has grown out of their own business and they have several offerings that are obviously simply surfacing of their internal operations like for instance Big Query. Google's infrastructure is fantastic in regards to reliability and performance. They do though in my opinion offer the most narrow platform solution of the big three. What they do provide though is truly top notch, and they are also priced accordingly unfortunately.

Price wise the big three are relatively similarly priced. If your application can take advantage of the AWS spot pricing you can get away with really cheap solutions though. Google is usually the most expensive (I say usually since the prices change all the time for cloud services). One thing that could be worth investigating is if you qualify for a Microsoft Bizspark membership because if you do you will receive $150/month of free credits to use for almost anything in Microsoft Azure (And it also includes licenses to almost every product that Microsoft has in their very extensive portfolio).

Saturday, January 24, 2015

How to get more free build or test minutes with your Visual Studio Online account

If you are one of the lucky ones who has an MSDN or a BizSpark subscription (One of the best deals around on the internet) and use the hosted build environment of Visual Studio Online it is tricky that you only get 60 minutes of free build time a month if you want to do continuous integration (Which you should do!) using it. However I just discovered a trick how to get around this limit by accident.
  1. First of all log into your Azure management console and then go to the tab for Visual Studio Online subscriptions.
  2. Then click on the unlink button at the bottom when you have the subscription you want new minutes for building or testing on. You will get a warning about losing any licenses you have purchased through your Azure subscription for Visual Studio Online so if you have done that you can't use this trick.
  3. Then click on new at the bottom left of the management screen to link your Visual Studio Online account back to your Azure subscription.
  4. Select the Visual Studio Online account you unlinked earlier and make sure you have the correct subscription selected in the drop down (It defaults to the pay as you go subscription so you will need to change this).
  5. Press the link button in the lower right.

That's it, if you go back your home page on Visual Studio Online you should be able to see that you have a new allotment of build and test minutes.

DISCLAIMER: You might be violating your terms of service with Microsoft by doing this and I also expect Microsoft to fix this at some point so you at your own risk.

Thursday, January 22, 2015

Problems with singletons

One of the most basic software design patterns is the singleton pattern and you'd think this wouldn't be one that would cause you problems but in C# it can be surprisingly difficult and I just spent a couple of hours trying to track down a bug because I hadn't implemented one properly. The place in question was using the first implementation and was being used from multiple threads that all started at the same time.

This is the simple pattern and it almost always works except when you are doing the first access from multiple threads at the same time, but when it doesn't it can be a real hard bug to find.

  internal class Simple
  {
    private static Simple instance;

    public static Simple Instance
    {
      get
      {
        if (instance == null)
          instance = new Simple();
        return instance;
      }
    }
  }

It should be pretty obvious that this class would have problems with concurrency so the simple solution is to just add a lock around the whole thing.

  internal class Lock
  {
    private static readonly object lockObj = new object();
    private static Lock instance;

    public static Lock Instance
    {
      get
      {
        lock (lockObj)
        {
          if (instance == null)
            instance = new Lock();
        }
        return instance;
      }
    }
  }

This class is simple and does work, but getting the lock has a performance penalty which makes it useful to keep looking.

  internal class DoubleLock
  {
    private static readonly object lockObj = new object();
    private static DoubleLock instance;

    public static DoubleLock Instance
    {
      get
      {
        if (instance == null)
        {
          lock (lockObj)
          {
            if (instance == null)
              instance = new DoubleLock();
          }
        }
        return instance;
      }
    }
  }

This class is a little bit more complicated, but it has the advantage that except for the very first check there is no locking required. It does rely on the assignment of a reference to a variable being an atomic assignment, but this is fortunately a valid assumption.

However, you can also use the C# runtime to help you create the singleton using a static constructor.

  internal class Static
  {
    private static Static instance = new Static();

    public static Static Instance
    {
      get
      {
        return instance;
      }
    }
  }

This is pretty much as efficient as it gets, you even got rid of the check for null. And this is thread safe as well. It does have the disadvantage of the instance of the singleton being created right before the first access of anything in the class, which might not be what you are looking for if there are more static methods in the class. The following class is based on the previous concept but does not create the singleton until the first time it is accessed.

  internal class DoubleLazy
  {
    private static class LazyLoader
    {
      public static DoubleLazy instance = new DoubleLazy();
    }

    public static DoubleLazy Instance
    {
      get
      {
        return LazyLoader.instance;
      }
    }
  }

The nested class static constructor will not be called until you read the instance. If you are running C# 4.0 or later there is a helper class that makes this easy to do as well using a lambda expression.

  internal class NewLazy
  {
    private static Lazy instance = new Lazy(() => new NewLazy());

    public static NewLazy Instance
    {
      get
      {
        return instance.Value;
      }
    }
  }

This method also allows you to check if you have instantiated the singleton or not (You can still do that with the first implementations, but it is not possible with any of the ones that using the static initializer). So which one should you chose. It might depend on different aspects, but if the only thing you care about is performance I made some relatively unscientific measuring and came up to the following list.

  • The simple static initializer is the absolute fastest implementation.
  • The nested static initializer is only slightly slower.
  • The simple non thread safe solution is slightly slower.
  • The double lock solution is only slightly slower than the the previous three.
  • The lazy lambda expression solutioin takes roughly 50% longer to run than any of the previous solutions.
  • The lock solution is roughly 150% slower than any of the first 4 solutions.

That said even the slowest solution can still perform roughly 40 million accesses to the singleton per second from a single thread on my laptop so unless you access it a lot it really doesn't matter.

Wednesday, January 21, 2015

Caching in a distributed applications

In the basic concept caching of data is really simple. You have a small fast storage medium of limited size and in it you save a subset of items from a larger slower storage medium that you are likely to use often. The typical example would be CPU on die cache which is in any modern CPU or disk caching in any modern operating system. However even in this case it starts getting complicated when you start adding multiple cores and need to make sure data from one CPU isn't being accessed from another CPU while a new value is only available in the first CPU's on die cache.

When you start developing distributed application this problem becomes incredibly complicated and you need to really think about what kind of data you are dealing with at every moment to make sure you end up with a high performance final application. Data tend to fall into one of several different categories.

  • Static data that never changes, but is too large to keep in memory on every instance needing it. This is obviously the easiest kind of data since you can just keep as many as possible available in local memory and start letting old ones slip from memory once you run low or after it hasn't been used for a certain time.
  • Seldom changing data that isn't critical that it is always completely up to date. This can also be cached locally, but you have to make sure you throw away the cache after a certain amount of time to make sure your data doesn't become too stale. Changes can be written directly to a database since they happen relatively seldom.
  • Seldom changing data that need to always be read in proper transactions correctly. Since this data is changed seldom you could just not cache this data. Alternatively you could use an in memory central cache like Redis or Memcached, you just have to make sure every place that access it uses the same method, it is also kind of tricky to deal with transactional changes on in memory databases, but it can be done.
  • Rapidly changing data that isn't critical that it is always completely up to date. Works pretty much the same as seldom changing data except that you probably want some sort of in memory cache for the changes so that you don't have to post every change to a normal database. You can use the in memory database to help you create batches of changes that you post periodically instead of with every change.
  • Rapidly changing data that need to always be read atomically correct. This one is tricky and there isn't really any good way of dealing with this except if you can figure out a way in which messages that need to access the same data ends up on the same node for processing every time. Usually this can be done by routing data based on a specific key and then cache locally based on that. Since you know that all messages that need the data will always end up on one node this should be safe. You do need to make sure you properly handle the case when a processing node goes away though.

Sunday, January 11, 2015

Check out this bag that my mom and wife made from old Dunkin Donuts coffee bags

Check out this awesome bad that my mom and wife made from used Dunkin Donuts coffee bags! I have some more pictures over here.

Friday, December 26, 2014

Public facing hard to guess identifiers

This might take some explaining of the actual problem. When applications are reporting information I would like there to at least not be possible to guess an identifier starting at 1 and have the data end up on some other users account. My goal isn't so much to guard against somebody who is intentionally trying to do misreporting, but to make it hard enough to do for all but the determined attackers.

So how do you do this, one way is to just use GUID's for every identifier but I have always hated that and it leads to bad database design at least in my opinion. So my suggestion is to just use a simple integer identifier counting upwards internally. However whenever the identifier required to report for this is displayed to an end user I take this ID and encrypt it using a secret key with AES-256. This results in a pretty much random 16 byte array that you then encode using base 64 and present to the user. Once any reporting is done you simply do the reverse so that the ID needed to be used is the base 64 encoded encrypted value. This means that it will be almost impossible to guess a valid identifier for anything coming from the outside but internally you can still just deal with regular integers of varying size for everything.

The performance hit should be negligible since AES is implemented in hardware in recent CPU's and even without it AES is really fast.

Thursday, December 25, 2014

Dealing with timestamps

I thought I would do a detour though and share some thoughts on dealing with timestamps.

Something to take into account dealing with timestamps reported into the system is that I can not really trust that users have their clocks running correctly. And since data can be collected offline and then submitted after the fact I need to compensate for the devices that have really weird time settings (A surprising amount of people run their machines with the clock set to 1970). I would deal with this by simply have the clock as the reporting device thinks it is at the time of submission be included as part of the submission. This will give me a delta for how much all the other included timestamps in that particular submission need to be adjusted. It will not handle the case where the user has changed their clock between the start of the data being collected and the time it was submitted, but hopefully that will be a pretty rare occurrence.