https://www.henrik.org/

Blog

Sunday, January 25, 2015

Choosing your cloud provider

When you start any coding project you generally need some sort of server capability even if the application your building is not a web site. When choosing your cloud provider there are several different things to think about.

First of all you need to consider if what you need is very basic and will not require a high SLA or the ability to grow with usage you are probably better off choosing a virtual private server provider. If you are fine with a Linux box these can be had extremely cheap. I used to be a customer of Letbox and at the time they provided me with a Virtual Private Server for $5/month, a price that is hard to beat. It is however important to realize that this is not a true VM, it is a specialized version of Linux similar to doing a chroot but also with quotas on memory and CPU usage. This means that these VM:s can only run Linux. That said the price is simply in a league of itself, cheaper usually than even spot instances of AWS EC2.

However once you have something slightly more complicated to run you probably want to go with a "real" cloud provider. These come in two kinds. The first level are companies providing infrastructure as a service (IaaS). This means basically providing virtual machines, storage and networking for them. It is up to you to build everything you need to run off of these primitives. Companies that offer only this kind of computing includes Skytap, Rackspace (Although Rackspace does have some platform services) and many more.

The next level up are the companies that provide platform as a service (PaaS). All of these companies also provide the infrastructure as well if you need it, but on top of this they provide useful packages that they will run for you as managed services to help creating, deploying and running your services easier. These minimally usually include.

  • Load balancing of traffic to multiple servers.
  • Auto scaling of new servers to handle varying load.
  • A fast and scalable NoSQL key value store.
  • A managed transactional database.
  • Web hosting.

There are as I see it three main players in this space and they are Amazon Web Services (AWS), Microsoft Azure and Google App Engine.

Of these Amazon is by far the largest. AWS started out as a mainly platform as a service offering, but now has one of the most complete set of managed services and they have by far the largest set of data centers located all around the world and one region qualified for US government work loads (Having an account on it requires you to be a US citizen so I can not use it). Their infrastructure is truly top notch, but their development tools are not great. Only a few languages have an official SDK (I myself have been missing an SDK for PERL).

Microsoft approached this space from the opposite direction from Amazon and started out by offering specific platform solutions and tightly integrating the development and deployment of Azure applications into their development tool Visual Studio. It is the only cloud provider I am aware of that for a time did not provide IaaS at all (Although they do now). The SDK and tooling for all of their products is truly excellent, especially if you are a .Net C# developer, but many other languages are supported as well. They do unfortunately and understandably run most of their infrastructure on Windows which simply is not as solid as other hyper-visors out there and if you are building a solution that requires reliable quick processing this can be a problem, especially if you have a cluster of synchronized machine this can become really problematic. These synchronization issues usually only occur a few times a month though as the service is migrated to new machines as all the VM:s running the service undergo the monthly Windows patching. However as long as your application does not rely on tight synchronization between several systems you are unlikely to notice it.

Finally there is Google. Their solution is similar to Amazon something that has grown out of their own business and they have several offerings that are obviously simply surfacing of their internal operations like for instance Big Query. Google's infrastructure is fantastic in regards to reliability and performance. They do though in my opinion offer the most narrow platform solution of the big three. What they do provide though is truly top notch, and they are also priced accordingly unfortunately.

Price wise the big three are relatively similarly priced. If your application can take advantage of the AWS spot pricing you can get away with really cheap solutions though. Google is usually the most expensive (I say usually since the prices change all the time for cloud services). One thing that could be worth investigating is if you qualify for a Microsoft Bizspark membership because if you do you will receive $150/month of free credits to use for almost anything in Microsoft Azure (And it also includes licenses to almost every product that Microsoft has in their very extensive portfolio). In the end, this credit is what convinced me to go with Microsoft Azure since it allows me to pretty much get the entire service up and running for free instead of having to pay several hundred dollars a month while developing on AWS. A close runner up was Amazon that are probably technically better once the software is developed, but the price credit and the accelerated development from better tooling swayed me. Google was unfortunately missing a few of the managed services I required for my design.

Saturday, January 24, 2015

How to get more free build or test minutes with your Visual Studio Online account

If you are one of the lucky ones who has an MSDN or a BizSpark subscription (One of the best deals around on the internet) and use the hosted build environment of Visual Studio Online it is tricky that you only get 60 minutes of free build time a month if you want to do continuous integration (Which you should do!) using it. However I just discovered a trick how to get around this limit by accident.
  1. First of all log into your Azure management console and then go to the tab for Visual Studio Online subscriptions.
  2. Then click on the unlink button at the bottom when you have the subscription you want new minutes for building or testing on. You will get a warning about losing any licenses you have purchased through your Azure subscription for Visual Studio Online so if you have done that you can't use this trick.
  3. Then click on new at the bottom left of the management screen to link your Visual Studio Online account back to your Azure subscription.
  4. Select the Visual Studio Online account you unlinked earlier and make sure you have the correct subscription selected in the drop down (It defaults to the pay as you go subscription so you will need to change this).
  5. Press the link button in the lower right.

That's it, if you go back your home page on Visual Studio Online you should be able to see that you have a new allotment of build and test minutes.

DISCLAIMER: You might be violating your terms of service with Microsoft by doing this and I also expect Microsoft to fix this at some point so you at your own risk.

Thursday, January 22, 2015

Problems with singletons

One of the most basic software design patterns is the singleton pattern and you'd think this wouldn't be one that would cause you problems but in C# it can be surprisingly difficult and I just spent a couple of hours trying to track down a bug because I hadn't implemented one properly. The place in question was using the first implementation and was being used from multiple threads that all started at the same time.

This is the simple pattern and it almost always works except when you are doing the first access from multiple threads at the same time, but when it doesn't it can be a real hard bug to find.

  internal class Simple
  {
    private static Simple instance;

    public static Simple Instance
    {
      get
      {
        if (instance == null)
          instance = new Simple();
        return instance;
      }
    }
  }

It should be pretty obvious that this class would have problems with concurrency so the simple solution is to just add a lock around the whole thing.

  internal class Lock
  {
    private static readonly object lockObj = new object();
    private static Lock instance;

    public static Lock Instance
    {
      get
      {
        lock (lockObj)
        {
          if (instance == null)
            instance = new Lock();
        }
        return instance;
      }
    }
  }

This class is simple and does work, but getting the lock has a performance penalty which makes it useful to keep looking.

  internal class DoubleLock
  {
    private static readonly object lockObj = new object();
    private static DoubleLock instance;

    public static DoubleLock Instance
    {
      get
      {
        if (instance == null)
        {
          lock (lockObj)
          {
            if (instance == null)
              instance = new DoubleLock();
          }
        }
        return instance;
      }
    }
  }

This class is a little bit more complicated, but it has the advantage that except for the very first check there is no locking required. It does rely on the assignment of a reference to a variable being an atomic assignment, but this is fortunately a valid assumption.

However, you can also use the C# runtime to help you create the singleton using a static constructor.

  internal class Static
  {
    private static Static instance = new Static();

    public static Static Instance
    {
      get
      {
        return instance;
      }
    }
  }

This is pretty much as efficient as it gets, you even got rid of the check for null. And this is thread safe as well. It does have the disadvantage of the instance of the singleton being created right before the first access of anything in the class, which might not be what you are looking for if there are more static methods in the class. The following class is based on the previous concept but does not create the singleton until the first time it is accessed.

  internal class DoubleLazy
  {
    private static class LazyLoader
    {
      public static DoubleLazy instance = new DoubleLazy();
    }

    public static DoubleLazy Instance
    {
      get
      {
        return LazyLoader.instance;
      }
    }
  }

The nested class static constructor will not be called until you read the instance. If you are running C# 4.0 or later there is a helper class that makes this easy to do as well using a lambda expression.

  internal class NewLazy
  {
    private static Lazy instance = new Lazy(() => new NewLazy());

    public static NewLazy Instance
    {
      get
      {
        return instance.Value;
      }
    }
  }

This method also allows you to check if you have instantiated the singleton or not (You can still do that with the first implementations, but it is not possible with any of the ones that using the static initializer). So which one should you chose. It might depend on different aspects, but if the only thing you care about is performance I made some relatively unscientific measuring and came up to the following list.

  • The simple static initializer is the absolute fastest implementation.
  • The nested static initializer is only slightly slower.
  • The simple non thread safe solution is slightly slower.
  • The double lock solution is only slightly slower than the the previous three.
  • The lazy lambda expression solutioin takes roughly 50% longer to run than any of the previous solutions.
  • The lock solution is roughly 150% slower than any of the first 4 solutions.

That said even the slowest solution can still perform roughly 40 million accesses to the singleton per second from a single thread on my laptop so unless you access it a lot it really doesn't matter.

Wednesday, January 21, 2015

Caching in a distributed applications

In the basic concept caching of data is really simple. You have a small fast storage medium of limited size and in it you save a subset of items from a larger slower storage medium that you are likely to use often. The typical example would be CPU on die cache which is in any modern CPU or disk caching in any modern operating system. However even in this case it starts getting complicated when you start adding multiple cores and need to make sure data from one CPU isn't being accessed from another CPU while a new value is only available in the first CPU's on die cache.

When you start developing distributed application this problem becomes incredibly complicated and you need to really think about what kind of data you are dealing with at every moment to make sure you end up with a high performance final application. Data tend to fall into one of several different categories.

  • Static data that never changes, but is too large to keep in memory on every instance needing it. This is obviously the easiest kind of data since you can just keep as many as possible available in local memory and start letting old ones slip from memory once you run low or after it hasn't been used for a certain time.
  • Seldom changing data that isn't critical that it is always completely up to date. This can also be cached locally, but you have to make sure you throw away the cache after a certain amount of time to make sure your data doesn't become too stale. Changes can be written directly to a database since they happen relatively seldom.
  • Seldom changing data that need to always be read in proper transactions correctly. Since this data is changed seldom you could just not cache this data. Alternatively you could use an in memory central cache like Redis or Memcached, you just have to make sure every place that access it uses the same method, it is also kind of tricky to deal with transactional changes on in memory databases, but it can be done.
  • Rapidly changing data that isn't critical that it is always completely up to date. Works pretty much the same as seldom changing data except that you probably want some sort of in memory cache for the changes so that you don't have to post every change to a normal database. You can use the in memory database to help you create batches of changes that you post periodically instead of with every change.
  • Rapidly changing data that need to always be read atomically correct. This one is tricky and there isn't really any good way of dealing with this except if you can figure out a way in which messages that need to access the same data ends up on the same node for processing every time. Usually this can be done by routing data based on a specific key and then cache locally based on that. Since you know that all messages that need the data will always end up on one node this should be safe. You do need to make sure you properly handle the case when a processing node goes away though.
The application I am working on have all of the above kinds of data.
  • Dimensional id lookup tables are added to, but once an item has been added it never gets deleted so it is static.
  • Billing information, account data is always updated in transactions and it always has to be currently. Luckily it isn't accessed very often.
  • Information about when an account has gone over their allowed quota seldom change and if it takes a minute or two before it gets block that is OK.
  • When we process incoming data it is important that it eventually is added to the ware house tables, but if it takes a few extra seconds that is OK.
  • Processing incoming data we need to know what data has been processed before to make sure if this is a unique entry or a duplicate. Fortunately all entries are grouped by a user ID that is a perfect partition key for separating work based on this data.
So what architecture am I using to deal with this deluge of data?
  • For all billing and account related data a normal relational database is used to keep the data in.
  • Almost all data is cached in an in NoSQL in memory database for quick access. All data that need to be processed with high volume uses this store as the main repository of changes which is then written out in batches to the appropriate permanent storage.
  • Once the data is properly aggregated it is saved in a proprietary on disk database format.
  • Archived raw input data is stored an a blob storage service once a day.

Wednesday, January 14, 2015

Finally finished ingestion of flat input data

It has been a while since my last update. In part that has been because I actually took some time to relax during my vacation, but also I have been working on a relatively large piece.

The piece in question is the code that ingests flat raw data and then in real time aggregates this data into the format needed to store it in a cube for quick analytical access. This is a really large piece of code and there are so many combinations of how the data can be submitted making a comprehensive test of this code took me a long time (Some of this time also being procrastination while I tried to figure out where to even start).

That said the code is finished, checked in, unit tested and seem to work. It is some of the most beautiful code I have ever written. The whole thing is built from the ground up to be massively multithreaded with the complication of many layers of caching to keep performance up while still not using too much resources and also being able to scale horizontally.

With this piece done I am finally at a place where almost all the pieces of the ingestion are actually in place and I now plan on working on the code to run and manage these processes on the machines.

Sunday, January 11, 2015

Check out this bag that my mom and wife made from old Dunkin Donuts coffee bags

Check out this awesome bad that my mom and wife made from used Dunkin Donuts coffee bags! I have some more pictures over here.

Monday, December 29, 2014

What is so hard about analyzing usage statistics anyway?

This might seem like a stupid question, but I thought I would go through the key problem that is hard when it comes to analyzing usage statistics. The key problem is to fast and efficiently detecting duplicates in large sets of data.

Basically you have a bunch of devices that submit data about what they are doing at any given time and that is your input. However, what you generally want to know from this data is how many of the reporting devices are doing a specific thing. This means that as data is being ingested the system need to know if we have seen a specific type of event during a certain time period from this specific device before or if it is a duplicate of a previous event. It is usually more interesting to know that 1000 different users use a certain feature than if one user uses the same feature 1000 times but nobody else uses it at all.

This is easy to do in small sets of data by just creating a hash or sorted list that you keep in memory and do quick look ups into. It is also no problem to do technically if it doesn't need to happen fast by just adding everything to a database and doing tons of selects against it. But if you want to do it fast it gets tricky. I'm solving it by an extensive use of asynchronous in memory database look ups.

Then you encounter another problem which is, how can you be sure that the data in your in memory database is actually up to date with new data coming in if all the data is processed asynchronously. Here I am using queues for incoming data in a way that ensures that no new data that would need data not yet committed in the in memory database will be processed until it is available.

Then the next problem is dealing with queues when you have a truly high rate of data coming through them. I'm using mostly Azure event hubs to deal with this data except for certain queues that are not performance critical.