Tuesday, March 3, 2020

Announcing the first public release of Underscore Backup

The first Underscore Backup pre-release is available for immediate download from Github.
  • Public key based encryption allows continuously running backups that can only be read with a key not available on the server running the backup.
  • Pre-egress encryption means no proprietary data leaves your system in a format where it can be compromised as long as your private key is not compromised.
  • Runs entirely without a service component.
  • Designed from the ground up to manage very large backup sets with multiple TB of data and millions of file in a single repository.
  • Multi-platform support based on Java 8.
  • Low resource requirements, runs efficiently with only 128MB of heap memory.
  • Efficient storage of both large and small file with built in de-duplication of data.
  • Handles backing up large files with small changes in them efficiently.
  • Optional error correction to support unreliable storage destinations.
  • Encryption, error correction and destination IO are plugin based and easily extendable.
  • Currently supports local file and S3 for backup destinations.

Best of all it is available as open source for free under a GPLv3 license.

For now this software is still under heavy development and should not be relied upon to protect production data.

Monday, April 15, 2019

Released Your Shared Secret Service

I recently published Your Shared Secret service which allows you to safely and securely ensure that private information that you have is not lost if you are in any way incapacitated.

The basic premise is that information is submitted through your browser where it is encrypted before it is ever sent to the service. The key to decrypt the information never leaves your browser. The key is then then chopped up into multiple pieces which are securely handed out to a number of people that you chose to act on your behalf and only by a group of them collaborating (You chose how many) can they together assemble the key required to access your information. For a quick introduction you can check out this video.

I really went all out on the privacy aspect of this website and service and have gone out of my way to not collect any information not needed for the operation of the service. The site have no third party links except for when collecting payments and it does not collect any visitor analytics such as for instance Google Analytics.

You have complete control over who of your caretakers is able to initiate accessing the information and also how many of the total group of caretakers need to participate to access the information. Even better the service does not even need know how to contact the caretakers. This information is only known by the unlocking caretaker and the owner of the information.

Furthermore the act of one of your caretakers trying to assemble the key will give you as the creator a notification that allows you to cancel the unlocking information or delete the information all together within a 7 days quarantine period. For more information on how the service works see the Usage section on the website.

The entire service operates on a zero trust model where all the functionality is ensured with cryptographically strong primitives with the single exception being the 7 day quarantine period. There is plenty of detailed information on how the encryption works in detail and how the service has been built. To ensure what is claimed on the site is actually what is happening the source code to the entire service is published on GitHub and you can even run the entire website locally by just cloning the website repository and just run.

npm start

The service is available for an introductory price of $1 or if you want to have completely anonymity you can also pay with either Etherium or Bitcoin although that is slightly more costly because of the value fluctuations of these currencies.

The service is available now so feel free to get started now keeping your information safe without you.

Thursday, December 6, 2018

How to maintain work life balance and sanity in the tech industry

As you collect experience and hopefully get more senior in your position you are at some point undoubtedly going to get to a point where this is something you will need to start thinking about.

I surprisingly often get asked by junior colleagues about how I deal with this, with the implication (At least in my mind) that I seem to have gotten some things right in their eyes so I figured I will try to share some of my insights, tips and tricks in this area.

Work with something you love

This might sound kind of obvious, but also I have found a lot of people don't like what they do. The key here is that you want to work with something that you don't mind doing if you have nothing else going on. The other part of this tip is that when you don't have anything else going on, try to have the discipline to actually work. What this allows you to do is not work when you do have other things you would like to do.

Make sure you also have a boss that realizes that as long as you get your stuff done, it doesn't matter how or when you do it. If this isn't the case for you, start looking for another place to work.

Set boundaries

In my entire life I don't think I have ever worked on a project that had enough time to get everything that we wanted done in the time allotted to do it. With that you have to realize that very few people will tell you to work less, it is up to you to set the boundaries of how much you work. There is also a point of diminishing returns for when the quantity of work put in doesn't actually increase your productive output.

You also shouldn't compare yourself to your colleagues too much, especially when it comes to the quantity of work put in. First of all there have been tons of studies that have concluded that when people estimate how much work they have put in they are usually over estimating it. So when a coworker of your tell you that they have worked 80 hour weeks, take that with a grain of salt. Secondly people might be putting in a lot of time without actually getting a lot of things done. Be the person who works smart instead of hard.

As an example think of the person who works 80 hours a week, week after week, making sure that a system that is unstable stays healthy and when it isn't constantly nurses it back to health. Compare this to the person who instead figures out the few defects in the system that causes it to be unstable and fixes them so the system now runs more or less by itself. Who of these two would you think was most valuable to the team?

Manage distractions

Another aspect of being able to manage your work life balance is to make sure that when you work, you are as productive as possible. In today's world there are so many things that are constantly trying to pull your attention away from what you are supposed to be working on. Especially as you become more senior it will get more important to manage your distractions effectively so that you can actually get things done.

In my own case I have several ways in which people can connect with me and I control obsessively what methods are actually able to notify me immediately instead of me only seeing it when I check them.

  • My pager (Piece of software on my phone). This is the only thing that I allow to wake me up if I sleep.
  • Phone calls and text messages. This is the only other thing on my phone that is allowed to either vibrate or make a sound. The one exception is that I allow my calendar to make a tiny chirp when I have a meeting. Apart from that my phone is silent.
  • Chat applications. I don't let these in any way interrupt me. This includes no visual popups or sounds in any way. They are the first things I check when I take a break from work to see if anybody needs anything from me though.
  • Email. It amazes me that almost everybody I know allows mail to both make a sound and show a popup on their computer. For me email is something I check a few times a day and my personal goal is to read emails within no more than a day after it is sent. If you need something faster from me you would need to ping me in another way or just get lucky.

All this comes down to trying to get as many prolonged periods of time that allow you to actually focus on whatever problem you have at hand without being constantly interrupted. When you inevitably drift off and lose focus, then you go and check if anybody needs something from you. Not the other way around where you lose focus because other people need your input (Unless it is really important and time critical).

Don't hoard knowledge

An extension to managing distractions is to do your best to not hoard knowledge. First of all, if you are the only person who knows something then you are guaranteeing that people will need to bug you to figure out how things are working. Secondly, you are by extension implying that you are sure that your coworkers can in no way improve your work. Now, if that is true then I feel sorry for you because that doesn't sound like a fun place to work.

I have met several people that do this more or less unconsciously probably as a safety mechanism to improve job security. This is misguided though because it ensures that as you progress in your career you are guaranteeing that you will be spending more and more of your time maintaining old projects instead of working on new things and leaving the maintenance to other people and I have hardly ever found an engineer that would not prefer to work on new things rather than maintenance.

Also on this topic is to be open for people to come to you and ask questions and your goal should be to explain it well enough that they understand it well enough to be able to answer the same question if somebody else asks it off them.

As an ending thought I have a request for junior people reading this though to try to not ask the same question too many times before you write it down so that you don't have to ask it again (Again, just trying to manage your senior colleagues distractions).

Tuesday, May 3, 2016

Comparing Macbook Pro to Windows 10 based laptop for software development

My post from a few years ago about Why I hate Mac and OSX is by far the most read post I have ever posted on this blog (Somebody cross-posted it to an OSX Advocacy forum and the flame war was on). So it has been a few years, both OS X and Windows has moved on since 2009 and hardware has improved tremendously. I have also started a job which more or less requires me to use a Mac laptop so I have recently spent a lot of time again working with a Mac so I figured I would revisit the topic of what I prefer to work with.

The two laptops I will be comparing specifically is a Dell Precision 7510 running Windows 10 vs a current 2015 Macbook Pro running OSX El Capitan.

Before I start the comparison I'll describe what and how I use a computer. I'm a software developer that has been working with this for decades. I prefer to use my keyboard as much as possible. If there is a keyboard shortcut, I will probably use it pretty quickly. I tend to want to automate everything I do if I can. I have great eyesight and pretty much the most important aspect a laptop is that it has a crisp high resolution screen (Preferably non glossy) which to me translates to more lines of code on the screen at the same time. So with that in mind lets get started.


This one is fortunately easy. For some bizarre reason OSX does no longer allow you to run in native resolution without installing an add-on. Even with that add-on installed the resolution is paltry 2880 by 1800 in compared to 3840 by 2160. That means that on my DELL I can fit almost twice as much text on the screen. Also Mac's are only available with a glossy screen which is another strike against it. I don't really care at all about color reproduction or anything like that, and even if I hear that the Mac is great at that (And so supposedly is the DELL) but don't care about that at all.

Windows used to have pretty bad handling of multiple screens before Windows 10, especially with weirdly high resolution. This has gotten a lot better with Windows 10. That said OSX has great handling of multiple screens, especially when you keep plugging in and out of a bunch of screens, things just seem to end up on the screen they are supposed to be when you do. Windows is much less reliable in this sense. That said, the better handling of multiple screens are nowhere near weighing up for the disaster that is the OSX handling of native resolutions or the low resolution of the retina display.

Winner: Windows


The PC is as a friend of mine referred to it "a tank". It is amazing how small and light the Macbook Pro is compared to everything that they crammed into it.

Winner: OSX

Battery Life

I can go almost a full day on my Mac, my PC I can go a couple of hours. No contest here, the Macbook Pro has amazing battery life.

Winner: OSX

Input Devices

Let me start off by saying that the track pad on the Mac is fantastic. Definitely the best I have ever used on any computer any category. That said why can't you show me where the buttons are (I hate that), the 3D touch feature is completely awful on a computer (I don't really like it on a phone either, but there it has its place). I started this review by saying that I use a lot of keyboard and when it comes to productivity there is absolutely no substitute for a track point. This is that weird little stick in the middle of the keyboard that IBM invented. The reason why it is superior is that when I need to use it I never have to move my fingers away from their typing position on the keyboard so I don't lose my flow of typing if I have to do something quickly with the mouse.

In regards to keyboards both Macbook Pro and the DELL Precision laptops have great keyboards. However, for some weird reason Macbook's still don't have page up and page down keys. And not only are there no dedicated keys for this, there isn't even a default keyboard shortcut that does this (Scroll up and scroll down which are available are not the same thing) so to get it at all you need to do some pretty tricky XML file editing. You also don't have dedicated keys for Home and End on a Macbook Pro. And given that there is so much space when the laptop is open not used by the keyboard on a 15" Macbook Pro I find it inexcusable.

Winner: Windows


With my Windows machine (And this is true for pretty much any tier 1 Windows laptop supplier) I call a number or open a chat and 1 to 2 days later a guy shows up with the spare parts required to fix it. With Apple I take it to the store and then they usually have to ship it somewhere, it takes a week or two... If you are lucky. For me that would mean that I can't work for those two weeks if I didn't have a large company with their own support department to provide me with a replacement to help out where Apple falls short.

Winner: Windows


I can open up my PC and do almost all service myself. Dell even publishes the handbook for doing it on their support site. Replacing the CPU would be very tricky because I think it is soldered to the motherboard, but everything else I can replace and upgrade myself. I also have 64GB of memory, two hard drives and if I want to upgrade a component in a year or two it wont be a problem. The Macbook Pro has Thunderbolt 2 which is great (Although the PC has a Thunderbolt 3 port), but that is pretty much it in regards to self service upgrades.

Also my PC beats the Mac on pretty much any spec from HD speed, size, CPU, GPU, memory.

Winner: Windows


Everybody talks about the Apple tax. I don't find that to be very true. A good laptop (Which don't get me wrong both of these are great laptops) costs a lot of money. And my PC cost quite a bit more than the Macbook Pro did. Granted it has better specs, but I don't think there is really any difference in price when you go high end with a laptop purchase.

Winner: Tie


For me productivity is synonymous with simplicity and predictability. Specifically I move around a lot of different applications and I need to be able to get to them quickly, preferably through a keyboard shortcut and I want to do it the same way every time. With that in mind OSX is an unmitigated disaster in this area. First of all, you have to keep track of if the windows you want to get to is in the same application or another one. And if it is another application, you first have to swap to the application you want and then after that you need to use a different keyboard shortcut to find the specific window in the application. I do like that you can create multiple desktops and assign specific applications to specific desktop (Predictable!). However then when you go full-screen with those windows they move to another desktop and this desktop has no predictability at all of where it is placed in comparison to other ones, it is strictly the order in which they are placed. Going on, I still don't understand how OSX still doesn't have a Maximize window button that takes the window and just makes it fill the screen. There are some third party tools that helps you a bit with this madness (Like being able to maximizing windows without going full-screen for instance). And regrettably in my opinion this is an area where OSX is moving backwards where the original Exposé was actually pretty good compared to the current mess. Also I don't like having the menu bar at the top of the screen because it means that it is usually further away from where my mouse currently is which means it takes longer to get there.

Meanwhile Windows 10 in this area took a huge leap with the snapping of windows to the side and allowing you to optionally selecting another window to see on the left. And you can easily switch to any window quickly using one keyboard shortcut same as always

A side note that doesn't affect me much but it does kind of need to be stated is that unsurprisingly Microsoft Office 2016 is just so much better on Windows than OSX.

Winner: Windows

Development Environment

In regards to development environments everything Java is available for both platforms so this comes down to comparing Visual Studio to XCode as far as I think. And obviously this comes down to whether you are developing in Swift or C# but since Visual Studio has recently moved more and more into the multi platform arena this is more of a real choice every day.

XCode has improved in huge leaps and bounds since the original versions I worked with (I started working with it around version 3). However there is simply no contest here. Visual Studio is the best development environment that I know. Both when it comes to native features, and the 3rd party extension system that support it is simply amazing. The only one that might possibly come close as far as I am concerned is IntelliJ.

Winner: Windows

Command Line Interface and Scripting

This is also a very easy call. OSX is Unix based, has a real shell, PERL and SSH installed by the OS. Sure Powershell is OK, but I just don't like it. I would argue that I think the terminal emulation in Putty seems a little bit better than Terminal, but on the other hand it doesn't have tabs and it also isn't installed by default.

Winner: OSX

Software Availability

This is a tricky category because there is obviously a lot more software available on Windows than OSX. However I find OSX has a lot of really good software that isn't available on Windows in similar quality. So I'm going to call this another tie.

Winner: Tie


You would think that this is an easy win for Mac. And for normal non power users I would say that is absolutely true. It is harder for a non technical user to mess up an OSX system than a Windows system, no question about it. I however tend to tinker with stuff that normal people wouldn't and I can say that I have managed to mess up my Mac several times to the point where it will not boot and I have to completely reinstall the OS. However, I have done the same thing more times on Windows than on OSX I think. I also am a little bit worried about Apple's general stance on solving security issues in a timely manner, something that Microsoft is actually really good it. That said, even though this is not as much of a slam dunk as you would think I still have to give this to OSX.

Another thing I would like to add in here is that pretty much every PC that I have bought there have been some part of the hardware that did not quite live up the expectations. On my previous laptop DELL Precision m4800 it was the keyboard (In 2 years I replaced it 6 times), on this one I am still working with support on fixing some flakiness with the trackpoint. I have never had similar issues with any Apple computer (Although I did have an iPad 4 where the screen just shattered when I placed it on a table for no reason).

Winner: OSX


If you travel a lot and need to work on battery a lot I think you might want to give the Macbook a go. It's pretty neat.

That said the clear winner for me when it comes to both productivity, usability and just raw performance is going to be a Windows machine when it comes to doing software development. The beauty with Windows is that since there are so many of them you can usually find one that fits you exactly (There are obviously PC:s that are very similar to the Macbook Pro, for instance the bezel-less Dell XPS 15 looks pretty sweet if you are looking for a PC equivalent of a Macbook Pro).

Winner: Windows

Wednesday, April 27, 2016

How I studied for the AWS Certified Solutions Architect Professional exam

I recently took (and passed) the AWS Certified Solutions Architect Professional exam and figured I would share how I studied for this test. When I took the associate level of this exam I only had 3 days to study and very little existing experience with AWS before hand and that is definitely not how I would recommend taking these exams. For the professional level exam I had around 3 months of time from the time I started studying to when I had to pass the exam or my associate level exam would have expired.

If you are studying for the associate exam I think the study guide below would probably still work (Although it might be a bit of overkill), just skip the professional level white papers and courses on Linux Academy and Cloud Academy.

Full disclosure, I work for Amazon Web Services as of a couple of months, but the opinions expressed in here are my own.


Here are the things you should already have done and know before you start thinking about this exam.

  • You will need a broad general knowledge in IT. If you don't have it you can probably pass the associate level exam which is more focused on only AWS specific technology. For the professional level one you will need to have a broad general IT knowledge because they will assume you have a general understanding of how WAN routing, non AWS enterprise software (For instance do you know that Oracle RAC requires multicast and EC2 does not support that).
  • You need to have passed the associate level exam within 2 years.
  • I would highly recommend that you have been using AWS for a while. This will help you wrapping your head around some of the AWS specific concepts that other services are based on easier.

Study Outline

In short here are the things I did to study this.

  1. Start by reading all the recommended white papers listed at the official certification guide site. I would recommend reading both the professional and associate level ones, because everything you knew when you took the associate level exam you will still need for the pro level one.
  2. Sign up for Linux Academy and start taking the classes for first the associate level course and then the professional level course. Don't forget to take the labs as well. Don't take the final quizzes yet (The ones per section are fine though).
  3. Sign up for Cloud Academy and take their classes for associate level and professional level courses. Same thing here, wait with the final quizzes.
  4. Once I finished all the courses I read recommended the white papers again.
  5. Do all the final quizzes from both Cloud and Linux Academy and make sure you get a passing grade. If there are sections that you are weak in then go back and study deeper in those areas, both Linux Academy and Cloud Academy have a lot of content aside from the lectures they recommend for the CSA certification so you don't have to just listen to the same lectures over and over.
  6. Try the sample questions from Amazon, you should be able to answer these by now. If you feel like shelling out some money for trying the sample exam go ahead. I skipped this step myself.
  7. Sign up for the exam.
  8. Read all the recommended white papers again the day before the exam.
  9. Take the exam.

Additional things you might want to consider.

  • Amazon recommends you taking the Advanced Architecting on AWS class. I took this class about 8 months before I took the exam and even though it is a good class I don't think it is that useful for passing the exam.
  • Amazon sometimes have AWS CSA Professional Readiness Workshops and if you have the ability to go to one of these I would highly recommend it. I am not sure if these are held outside of AWS re:Invent conferences though. For the associate level exam I know these workshops are held quite often and they are great too.
  • Qwiklabs is a great resource for practicing your AWS skills. That said if you have your Linux Academy and or Cloud Academy accounts they have labs too that are included in your subscription. These labs are better though if you can afford them.

If you can I would also recommend to start a study group and get together once a week or so and do sample questions and discuss the answers from one of the sources listed above. I did this with some of my work colleagues and I found that very helpful.


I would recommend that you plan that studying for this will take at least 2 months. I did it in roughly 3 months, but I only studied actively for about 4 to 6 of those weeks. When I studied I spent roughly two to four hours every evening. Unless you are already a whizz at AWS I doubt you can crank this into a few days, which is very doable for the associate level exam. Roughly I divided my time like this.

10%Initial studying of the white papers.
50%Watching the training videos on Linux Academy and Cloud Academy.
15%Taking labs.
10%Doing quizzes.
10%Additional revisions based on discovered deficiencies from the quizzes.
5%Re-reading the white papers (The second and third time I skimmed through them a lot faster than the initial deep read).

Taking the exam

Don't go until you feel you are ready, so don't schedule the exam until you feel done. At least where I live I could schedule the exam just one day out so you don't need to plan ahead for this.

I am usually a very fast test taker (I took the associate level exam in less than half the time. However time management is going to be important when you take this exam. When I took the test I finished all the questions with around 25 minutes to spare and at that point I had roughly 30% of them marked to be revisited. After going through them all again I had less than two minutes left of my time. It says that the test is 80 questions on the description, but I only had 77 questions in mine. I'm guessing number of questions vary slightly depending on how they are selected randomly.

Cloud Academy vs Linux Academy

Cloud Academy and Linux Academy have a lot of overlap and I recommend that you would subscribe to both of them for this. That said here are the advantages to each of them as far as I experienced it.

  • Linux Academy have more questions in the final quiz and vastly longer study material for the professional exam than Cloud Academy. The entire course in Linux Academy is around 30 hours long and the corresponding course in Cloud Academy is only around 3 hours. And this is not something that can be covered in 3 hours. Their associate level courses are much more on par.
  • Cloud Academy has a much better interface for doing quizzes and revisioning where after each question it tells you the answer and short extract of information about the answer with links to the AWS documentation.
  • Cloud Academy allows you to set the playback speed of the training videos which I like (I feel I can still assimilate information when playing these at around 1.5x speed and it saves time). Linux Academy also had occasional streaming issues in general for me requiring me to sometimes have to restart videos.
  • If you are a student or have an edu address Cloud Academy is a lot cheaper than Linux Academy with $9 per month. If you don't on the other side Linux Academy is cheaper than Cloud Academy with a factor of 2.
  • Both services are very easy to cancel once you are done with your studying in case you don't feel you need them anymore.

When all is said and done though I could probably have passed this with only Linux Academy, but Cloud Academy would not have been sufficient for me (Especially since the training material for the professional level CSA is so short). That said, I still think that the Cloud Academy course provides a valuable alternative to Linux Academy and especially if you can sign up as a student it is so cheap that there is pretty much no reason not to.

Tuesday, July 14, 2015

How to get the most out of your BizSpark azure credits

BizSpark is arguably one of the best deals on the internet for startups. For me the key benefit that it brings is the 5 x $150 per month of free Azure credits. That said they are a little bit tricky to claim.

The first thing you need to do is claim all you BizSpark accounts and then from each of those accounts claim your Azure credits. This blog post describes this process, so start by doing that.

So after doing this you have 5 separate Azure accounts each with $150 per month of usage. However what we want is one Azure account where we can see services from all of these subscriptions at once and that requires a couple of more hoops to jump through. In the end you will end up with one account where you can see and create services from all 5 subscriptions without having to log in and out the Azure management portal to switch between them.

  1. The first step is to pick the one account you want to use to administrate all the other accounts.
  2. This is a bit counter intuitive, but you need to start by adding every other account as co administrators to the account from the first step. Yes, I am saying this correctly. All the other accounts need to be added as administrators to the main admin account (Don't worry, this is temporary).
  3. The following steps need to be done for each of the accounts except for the main account from step 1.
    1. Log into the management console using one of the four auxiliary accounts and go to settings.
    2. Make sure you are on the subscription tab.
    3. Select the subscription that belongs to the account you are currently logged into. It will be the one that has the account administrator set to the account you are currently logged into. If you have done this correct you should see two different subscriptions, one for the subscription you are logged in as and one from the account in step 1.
    4. Click the Edit Directory button at the bottom.
    5. In the image below make sure you select the directory of the main account from step 1. It shouldn't be hard because it will be the only account in the list and pre-selected. If you have already set up any co administrators to the account you will be warned that they will all be removed.
    6. Add the account from step 1 as co administrator to this account as described in the linked to article at the top of the post.
    7. The last step is optional but all the subscriptions will be called Bizspark and hard to keep apart so you might want to rename them.
      1. To do this go to the Azure account portal at This page tend to be very slow, so be patient following links.
      2. Click on the subscription name. Your screen might look different depending on how many subscriptions you have.
      3. Click on the Edit Subscription Details.
      4. Enter the new name in the dialog presented. You can also optionally change the administrator to the account from step 1 at the top, this will remove the owning account as an administrator from the account all together (Although they are still responsible for billing).
  4. You can now remove all the other accounts from being administrators to the main account that you added in step 2 if you want.

If you follow all these steps when you log into the account from step 1 you should be able to see all of your subscriptions at the same time in the Azure management console like in the screenshot below.

Keep in mind this does not mean that you have $750 to spend as you want. Each subscription still has a separate limit of $150 and you have to puzzle together your services as you create them to keep all of the 5 limits from running out but at least this way you have a much better overview of what services you have provisioned in one place.

Thursday, July 9, 2015

Algorithm for distributed load balancing of batch processing

Just for reference this algorithm doesn't work in practice. The problem is that nodes under heavy load tend to be too slow to answer to hold on to their leases causing partitions to jump between hosts. I have moved on to another algorithm that I might write up at some point if I get time. just a fair warning to anybody who was thinking of implementing this.

I recently played around a little bit with the Azure EventHub managed service which promises high throughput event processing at relatively low cost. At first it seems relatively easy to use in a distributed matter using the class EventProcessorHost and that is what all the online examples provided by Microsoft are using too.

My experience is that the EventProcessorHost is basically useless. Not only does it not contain any provision that I have found to provide a retry policy to make its API calls fault tolerant. It also is designed to only checkpoint its progress at relatively few intervals meaning that you have to design your application to work properly even if events are reprocessed (Which is what will happen after a catastrophic failure). Worse than that though once you fire up more than one processing node it simply falls all over itself constantly causing almost no processing to happen.

So if you want to use the EventHub managed service in any serious way you need to code directly to the EventHubClient interface which means that you have to figure out your own way of distributing its partitions over the available nodes.

This leads me to an interesting problem. How do your evenly balance the load of work evenly over a certain number of nodes (In the nomenclature below the work is split into one or more partitions) which can at any time have a catastrophic failure and stop processing without a central orchestrator.

Furthermore I want the behavior that if the load is completely evenly distributed between the nodes the pieces of the load should be sticky, meaning that the partitions of work currently allocated to a node should stay allocated to that node.

The algorithm I have come up with requires a Redis cache to handle the orchestration and it uses only 2 hash tables and two subscription for handling the orchestration. But any key value store that provides publish and subscribe functionality should do.

The algorithm have 5 time spans that are important.

  • Normal lease time. I'm using 60 seconds for this. It is the normal time a partition will be leased without generally being challenged.
  • Maximum lease time. Must be significantly longer than the normal lease time.
  • Maximum shutdown time. The maximum time a processor has to shut down after it has lost a lease on a partition.
  • Minimum lease grab time. Must be less than the normal lease time.
  • Current leases held delay. Should be relatively short. A second should be plenty (I generally operate in the 100 to 500 millisecond range). This is multiplied by the number of currently processing partitions. It can't be too low though or you will run into scheduler based jitter of partitions jumping between partitions.

Each node also should listen to two Redis subscriptions (Basically notifications to all subscribers). Each will send out a notification that is the partition being affected.

  • Grab lease subscription. Used to signal that the leas of a partition is being challenged.
  • Allocated lease subscription. Used to signal that the lease of a partition has ended when somebody is waiting to start processing it.

There are also two hash keys in use to keep track of things. Each one contains the hash field of the partition and will contain the name of the host currently owning it.

  • Lease allocation. Contains which nodes currently is actually processing which partition.
  • Lease grab. Used to race and indicate which node won a challenge to take over processing of a partition.

This is the general algorithm.

  1. Once every time per normal lease time each node will send out a grab lease subscription notification per each partition that.
    • It does not yet own and which does not currently have any value set for the partition in the lease grab hash key.
    • If it has been more than the maximum lease time since the last time a lease grab was signaled for the partition (This is required for the case when a node dies somewhere after step 3 but before step 6 has completed). If this happens also clear the lease allocation and lease grab hash for the partition before raising the notification since it is an indication that a node has gone offline without cleaning up.
  2. Upon receipt of this notification the timer for this publications is reset (So generally only one publication per partition will be sent during the normal lease time, but it can happen twice if two nodes send them out at the same time. Also when this is received each node will wait based on this formula.
    • If the node currently is already processing the partition it will wait the number of active partitions on the node currently held times the current leases held delay minus half of this delay (So basically (Locally active partitions - 1) * current leases held delay).
    • If the node currently is not busy processing the partition that is being grabbed the node should wait the local active partitions plus one times the current leases held delay (On so fewer words (Locally active partitions + 0.5) * current leases held delay).
  3. Once the delay is done try to set the lease grab hash key for the partition with the conditional transaction parameter of it not being set.
    • Generally the node that has the lowest delay from step 2 should get this which also means that the active partitions on each node should distribute evenly among any active nodes since the more active partitions each individual node has the longer it will wait in step 2 and the less likely it is that they will win the race to own the partition lease.
    • If a node is currently processing a partition but did not win the race in step 2 it should immediately signal its partition to gracefully shut down and once it is shut down it should remove the lease allocation hash field for the partition. Once this is done it should also publish the allocated lease subscription notification. After that is completed this node should skip the rest of the steps.
  4. Check by reading the lease allocation hash value to see if another node than the winner in step 3 is currently busy processing the partition. If this is the case either wait for either the allocated lease subscription notification signaling that the other node has finished from step 3b or if this does not happen wait for a maximum of maximum shutdown time and start the partition anyway.
  5. Mark the lease allocation hash with the new current node that is now processing this partition.
  6. Also after the minimum lease grab time remove the winning indication in the lease grab hash key for the partition so that it can be challenged again from step 1.

When I run this algorithm in my tests it works exactly as I want it. Once a new node comes online within the normal lease time the workload has been distributed evenly among the new and old nodes. Also an important test is that if you only have one partition the partition does not skip among the nodes, but squarely lands on one node and stays there. And finally if I kill a node without giving it any chance to do any cleanup after roughly maximum lease time the load is distributed out to the remaining nodes.

This algorithm does not in any way handle the case when the load on the different partitions is not uniform, in that case you could relatively easily tweak the formula in step 2 above and replace the locally active partitions with whatever measurement of load or performed work you wish. It will be tricky to keep the algorithm sticky though with these changes.