Henrik Johnson's Blog

Why I created Underscore Backup

2023-07-17T18:44:00.025-07:00

I started running a server for storing all my projects as well as various multimedia artifacts in 1999 with a small desktop computer and a 20GB HDD. As the size and personal importance of this server grew within a few years I started running RAID5 and then RAID6 to make sure data was not lost from single drive failures. Despite this, in 2006 the current incarnation of this server encountered a catastrophic 3-drive failure which I only managed to recover from after a tremendous amount of work and a fair amount of luck which included among other things manually patching the Linux RAID kernel code to remove certain fail-safes as I pulled data off the partially assembled RAID.

"The Server" in its current iteration.

This episode led me to look for ways to safeguard against this ever happening again. Looking through what options were available to me I found Crashplan which did address all my needs at a reasonable price. My initial backup to Crashplan took several years to complete over my 20mbit/s broadband uplink as my server had at this point grown to several TB.

A few years after I started using Crashplan they stopped offering consumer backups and the only way to keep using them was to migrate to their business plan which I did. However, Crashplan only allowed you to migrate a few TB per computer at the time which meant that I had to re-upload most of my backup again. Fortunately, at this point, I had gotten a fiber internet connection with a reasonable uplink that allowed me to re-upload this data in less than a year. As my backup of this server grew Crashplan also started showing its flaws where it required several GB of memory to be able to back up my server, but it did work and allowed me a reasonable peace of mind for the contents of my server.

This went on for a few years after which I was contacted by Crashplan (Now called Code42) and told that unless I reduced the size of my backup to under 10 TB, they would terminate my account since they considered me violating their terms of service by keeping too large a backup.

From: Support Ops (Code42 Small Business Support) 
Date: Feb 6 2020, 10:38 AM CST 

Hello Administrator,

Thank you for being a CrashPlan® for Small Business subscriber. We appreciate the
trust that you have placed in CrashPlan - that relationship is important to us.
Unfortunately, we write to you today to notify you that your account has
accumulated excessive storage, which will result in degraded performance. You 
have one of the largest archives in the history of CrashPlan. It is so large, we
cannot guarantee the performance of our service. Due to the size of your
archive, full restores of your backup archive, and even selectively restoring 
specific files, may not be possible.

As a result, we are notifying you, per our Master Service Agreement and
Documentation, to re-duce your storage utilization for each device to less than
10TB by June 1, 2020. Note that we have extended your subscription to June 1, 
2020 to give you ample time to make changes. If you do not do so by June 1, 
2020, your subscription will not be renewed, and your account will be closed at
the end of your current subscription term.

…

Thank you, 
Eric Wansong, Chief Customer Officer, Code42

The server I was using was Linux based and as far as I could tell Crashplan was the only competitor on the market providing cloud-based backup solutions for that OS. This was when I decided to start working on Underscore Backup as a means for me to continue making backups of my server as I couldn’t find any existing alternatives that fulfilled my needs. The first version was command line only and very primitive even though it did support point-in-time recovery, backup sets as well as obviously efficiently handling my very large backup. Another feature that was built in from the beginning was a strong focus on encrypting everything as much as possible so that any medium could be used for backups even if it was not properly secured from prying eyes. Creating the initial backup of my server using Underscore Backup used a more or less sustained 600mbit/s (To be compared with the at the time impressive 60mbit/s that I experienced using Crashplan on the same connection).

At the same time, I also started using the iDrive service for backing up my laptops and various other smaller Windows and MacOS based machines. I did this because I didn’t think the CLI (Command Line Interface) only implementation of Underscore Backup was just not convenient enough to be used on these machines). This situation continued for a few years when the CLI-only version of Underscore Backup backed up my server data to cloud block storage and my other machines were backed up by the iDrive service. This all came crashing down when my main development laptop of several years had a catastrophic SSD failure and I had to restore my data from iDrive. I found out two things about how the iDrive service works.

The first is that even though iDrive keeps track of versions of your files they do not keep track of directory contents and deletions of files. This is critical to any developer, and I restored a large developer repository with files that I have been working on as I have been running iDrive in the background. For those of you who are not developers, we rename files a lot. And every one of the old names of all my renamed files was restored back when I did a full restore of the contents of my laptop’s hard drive. That meant, that any repository of code that I had basically worked on since I started using iDrive was no longer in a buildable state without a considerable amount of work.

The second surprise to me was that even though to me the iDrive backup of my laptop was relatively small, only around 50GB in size it took almost 2 weeks to restore. Granted it contained a large number of files (Around 3 million, mostly small, files) but I was shocked at the slowness of its performance. I also opened up several support cases with iDrive about this but it was nothing they could do to help me. For comparison, on the same network with roughly the same sized backup in both files and total storage Underscore Backup would complete a similar restore in about 5 minutes (And it would do it properly keeping track of deleted files).

At this point, I evaluated other solutions available but could not find any that would be suitable for my needs. Carbonite does not allow you to specify what files should be backed up but instead in the interest of simplicity tries to be smart about it, when I tried it on my development files it decided to back almost none of them even though I specifically said to include the directory. Backblaze is a very solid solution but also does not keep track of deleted files for a true point-in-time recovery same as iDrive. In the end, I decided that I would put in the effort needed to create an easy-to-use user interface for Underscore Backup so that it would be suitable for use on things other than servers. The end result of these efforts was the first stable release of Underscore Backup in the summer of 2022 and which at that point graduated to be the only backup solution I used on all my computers.

The problem at this point though was that even though I had a backup solution that fulfilled all my needs it was still very tricky to set up for most users since to use it you generally had to supply your own cloud storage such as Amazon S3. It was also quite tricky to access data from other sources you had backed up since every source had to be set up individually on each client you wanted to restore the source on. The sharing functionality, even though present was also so complicated that I am relatively certain nobody managed to set this up except for myself. To solve all of these problems I decided to leave the service-less nature of the software I had followed up until that point and create a service to both remove the need to provide separate cloud storage and also help manage multiple sources and set up shares. This was a relatively large undertaking, but it eventually led to the launch of Underscore Backup 2.0 in the first half of 2023.

This current release as of this writing is the upcoming 2.2 release which has made it very easy to set up backup of multiple computers of any size while staying true to the original guiding principles of security, durability, efficiency, and flexibility.

This post was cross posted from the Underscore Backup blog.

Announcing Underscore Backup 2.0 and service general availability

2023-04-26T21:54:00.005-07:00

First stable version of Underscore Backup with support for the companion service is now available. The service itself is also generally available.

Even with the new service, the main focus of the application is privacy, resiliency, and efficiency. The new service does significantly simplify setting up cloud backups and sharing though compared to use

The main new feature in version 2.0 is the introduction of a companion service that will help with many aspects of running Underscore Backup such as.

Keep all your sources organized in one place to easily restore from any of your backups to any other backup.
Help facilitate sharing of backup data with other users.
Optionally allow private key password recovery.
Easily access application UI even if running in a context where a desktop is unavailable, such as root on Linux.
Use as a backup destination. Storing backup data is the only feature that requires a paying subscription, giving you 512GB of backup storage for $5 per month.
Support multiple regions of data storage supporting North America (Oregon), EU (Frankfurt), and Southeast Asia (Singapore) regions to satisfy latency and data governance requirements.

On top of the companion service changes, the following features and improvements have also been implemented.

Added support for continuous backups by monitoring the filesystem for changes.
Introduced a password strength meter which requires a score of at least “ok” when setting up.
Switched from pbkdf2 to Argon2 for private key hashing function.

On top of the companion service changes, the following features and improvements have also been implemented. Get started by downloading the client now.

Building an online service on a shoestring budget

2023-03-07T19:44:00.004-08:00

Photo by Josh Appel on Unsplash

Although I have been working professionally as a software engineer since I was 18 years old I have always had hobby projects I have been working with on the side and I generally take a somewhat perverse pleasure in figuring out how to build and launch these things on as small of a budget as possible. This post is an attempt to go through some of the things I have found that have helped me be productive and successfully build and launch several hobby projects.

I am particularly going to assume that this is for hobby projects and that the skill and time of the participants are free. If you are paying any salary that will dwarf anything you might save by aggressively using free tiers of online services. I am also going to assume your team is small (Less than 5).

What not to skimp on

First, let us go over the things you should not skimp on. The most important thing here is to not use any equipment or software from your day job. The reason for this is that if you do then your employer can usually claim ownership of any IP produced with their equipment. Also check your employment contract to make sure your employer doesn't have a clause to claim ownership to anything you do. That said, if you live in California, even if your employment contract does claim this it is not enforceable as long as you don't use company equipment, time, IP and you are not directly competing with your employer (See labor code 2870 for details).

One more of the things I would advise you to do is to enroll in school if you are not already. Being enrolled in a Community College only costs a few hundred dollars a year and will provide you with free licenses to a huge amount of tools for software development. Telerik, IntelliJ, Autodesk, and many more give students a free non-commercial license to almost their entire catalog of tools and libraries. Granted, once you get to the launch stage you will need to buy real licenses for your tools, but it will still save you tons of money in the development phase. You might even learn something doing it.

Basic development tools

I believe that if code isn't checked into a source repository with change tracking it basically doesn't exist at all. So, the first thing to do when starting a project is to pick a source code repository. GitHub is the giant in the field and they are fantastic. Not only do they give you free private repositories they also give you 2000 minutes a month of build executions (GitHub Actions). If you are building open-source applications you even get unlimited build executions for free.

Next you probably want to choose a cloud provider. I would pick one of either AWS or Azure. If you can go Serverless then I would go with AWS since they have a perpetual free tier for everything you need to launch a Serverless service. If not, then Azure Bizspark is a great program if you qualify. AWS also has a program for $300 to spend getting your prototype ready. Another tip for getting started on AWS is to get a new account for any new project. This is because they have an additional massive free tier that only lasts for 1 year after opening the account. It is also generally best practice to only run 1 microservice per account. Once the freebies are over you can tie your accounts together using AWS Organizations and SSO to help you keep track of them all (Doing this will usually invalidate the free tiers so wait a year after account creation to do this).

You also likely need a web UI testing tool. I use Cypress which has a free tier and is overall very good. They only allow 500 test suites per month so you can't run canaries in the free tier, but it should be sufficient for any deployment-based testing. They also provide a dashboard where you can see which tests have succeeded and failed with videos of the test execution so you can easily troubleshoot failures, something that is very useful when you integrate it into your CI/CD pipeline.

How to build your software

The key thing you want to avoid if you are launching something on the cheap is fixed infrastructure. If possible, use serverless functions instead of hosts or containers to host run your code. With some thought, almost everything you build can be run in a true pay-per-use manner. For instance, with AWS you should aim to use API Gateway, Lambda, SQS, and DynamoDB. As your service scales, you might consider moving off some of these for cost reasons but these primitives are also able to scale to thousands of transactions per second without any change to infrastructure if done right and none of them have a fixed cost. You generally don't want to use services such as Kinesis, Elasticache, Opensearch, relational databases, hosts, or containers since these all come with minimum fixed costs even if your service has no usage.

Useful services with good free tier

Here are a couple of other services worth noting with useful features and good free tiers.

Google Analytics is ubiquitous for site analytics. It is having privacy issues in the EU though with several countries declaring it illegal recently. Another option that I use that with more of a privacy focus is Clicky.
Also useful from Google is Firebase which provides a lot of features such as a basic user database, usage analytics, and monitoring among others. It is a great choice if your primary use case is a mobile app. It is pretty inflexible for building complex applications or services though and you probably want to go with a normal cloud provider for that.
Cloudflare is Web Application Firewall and has a very useful free tier. They also provide a privacy-focused and less annoying CAPTCHA service called Turnstile.
Blogger is a free blogging platform. It will generally not let you build your entire website like Wordpress will, but if all you need is blogging it does that well and allows you to use custom domains for free.
Crisp is a great platform for providing support for your site and they have a nice free tier for getting started.
Auth0 provides a platform for helping you do auth of your users and has a decent free tier to get you started.
Most of the payment processors such as Square, Stripe, and Braintree only charge a percentage with no setup costs. Their fees are very similar, I prefer Stripe myself only because they have fantastic developer documentation.

Launching and running a service

As you first start out I tend to not think too much about schedules and deliverables. The reason for this is that I do this for fun and the best way to kill the fun is to start making yourself a slave to delivery commitments and launch dates. That said as you get closer to launch I really do think you need a way to keep track of remaining tasks and open bugs etc. In my opinion, Jira from Atlassian is by far the best and most comprehensive tool for this and as long as you have a small team everything you need is available for free.

You will need monitoring of your service before you go live. Both AWS and Azure have built-in monitoring tools and they work well. Also worth mentioning again in this space particularly is Firebase which does have some monitoring and analytics capabilities. Another service in this area that has a good free tier is New Relic. One thing that neither AWS nor Azure has is paging for when things actually go wrong. The tool that I found here that has a very functional free tier is Pager Duty, that said you are likely to want to upgrade from the free tier pretty soon as your service takes off to be able to have more control over your escalations.

Your service will likely need a single place to aggregate everything that is going on in one place such as task completions deployments and any issues and here Slack is hard to beat and have a great free tier.

Be frugal, not cheap

As a parting word, I would like to point out that although figuring out how to build and launch your service cheaply don't let that stand in the way of building your service right. Never pick the cheap option over the correct option, you will always regret it in the end.

For me, one of the main reasons why building things in a frugal way when I am working on hobby projects is that it allows me to have fun doing them longer because I don't have the pressure of needing to be done and launched fast because I am bleeding money during the development phase.

Being frugal during the development phase might also allow you to retain a larger portion of your equity if you actually launch your service because it will reduce the amount of help you will need to get started before you get a customer base. As an example, one of my previous projects Your Shared Secret literally has $0 per month of fixed cost. My more recent project Underscore Backup is not quite that cheap but has a fixed cost of less than $50 per month. Most of that cost is for CloudWatch alarms, KMS keys, and Dashboards.

Started another blog

2023-02-24T09:18:00.006-08:00

Created another blog at https://www.mauritz.dev for more shorter snippets of what I am working on right now. Really its just something to put on this domain that I've had for a while now without doing anything with it.

Launching Underscore Backup service and first beta of the version 2.0 of Underscore Backup application to use it

2023-02-16T21:18:00.000-08:00

Finally launched the first public release of the Underscore Backup service. A backup service that is a companion to the Underscore Backup application I have now been working on for a few years. I am really excited about this since adding a service component to the application solves a couple of user pain points with my previous releases, such as.

Its now easy to coordinate and keep track of multiple sources so you can easily restore data from another computer you are backing up.
It comes with the ability to use the service for storage so that you don't have to deal with configuring S3 or something similar. The storage also supports 3 regions in the US, EU and the Asia Pacific region and is priced lower than S3.
It makes it easy to set up sharing between users.
The service can provide optional secret key recovery.
You can easily keep track of where the administration interface of the application is available even if the application is running in a context that does not have access to your desktop.

There is also a ton of other additional features included such as improve password hashing algorithms, log rotation, integrated password strength meter and built in new version notification.

You can head to the site to download the latest version and sign up for the service. The service is entirely free, however you do need a subscription for using storage as detailed on the pricing page.

My quest for fiber provided by AT&T

2021-04-25T23:11:00.003-07:00

When I moved to a new house about two years ago, I was disappointed to learn that there were no options for fiber-based internet in the area so I would have to take the step down to cable based internet. Fortunately, I was pleasantly surprised in September of 2020 that AT&T Fiber had added support of my area.

First try in September 2020

I ordered it as quickly as I discovered it, even though I was a bit hesitant about having people in my house as COVID cases were on the rise (I have a person in my household who is in a risk group). The person on the phone with AT&T assured me that all AT&T personnel involved with the install would be wearing a mask though, so I proceeded regardless.

The day of the appointment I was excited and had cleared my schedule. The first person to show up was not the technician, but just a salesperson that wanted to make sure I didn't have any trouble creating my AT&T account (Which I had already set up days before as per the instructions in the AT&T communication). This person also assured me that they get in trouble with AT&T if they do not wear a mask which felt reassuring to me.

About an hour later, still during the appointment window the installation technician showed up (Still within the assigned service window). We tried to figure out where the AT&T connection at my house was and eventually found it. Unfortunately, there was no fiber pulled to my house and it needed to be pulled around 100 feet from a neighbor's access point. He tried snaking the existing conduit but failed. He needed to call in a specialist that both had better snaking equipment and if that failed, they might have to do some digging to fix the conduit.

The second technician made an appointment and showed up around a week later. He also had a helper with him. They spent a good hour trying to get through the conduit. They also failed and I was told that they now would have to bring in a 3rd party company that would try again and might potentially had to dig a new conduit.

The third technician just showed up with no appointment. He also had no mask and did not have one to put on when I asked about it. I told him to come back when he had a mask and at an appointed time. After this interaction I called AT&T to complain and was told that the mask mandate only really applied to AT&T employees and since this person was a 3rd party contractor there was nothing they could do. At that point I told the representative that I only wanted this after I had gotten the expressed promise that everybody involved would wear a mask and since that was not true, I did now want to cancel the order. The AT&T rep told me that they did not have the authority to cancel my order, but instead that had to transfer me to a loyalty specialist. I told them that they could either cancel the order or not but that I would not open the door or let them on my property and hung up.

About a week later another man showed up from AT&T, also not wearing a mask. I told him that, no I had not ordered any AT&T Fiber and that he should go away through the closed door. A week after that an additional person showed up from AT&T, this time with a mask. I explained to him what was going on and he apologized and said that he knew how to make this issue go away for me. And in fact, it turned out that he did because that was the last I heard from AT&T for the time being.

In total this first try involved 7 visits from AT&T with a total of 8 people visiting my house.

Second try in March 2021

Skip forward to March 2021, as I am now vaccinated, I decided it was time to make another try. I also happened on an ad for AT&T Fiber with a good introductory offer, so I decided to try again with an order placed online on March 8th. I got an initial appointment for the morning of March 18th. About an hour after the appointed time with no visit I called AT&T customer support. I was told not to worry. The technician was just running late, and he was still on the way. After another 3 hours I called again and was then told that the technician had gone to the wrong house and since nobody at that house had ordered internet he had left. At no point during this had AT&T proactively reached out to me to let me know what was going on.

Slightly miffed I rescheduled the appointment for a week later. That day came around and no technician showed up that day either. At around 2 hours after the appointment window ended, I called support again. The first person told me not to worry, the technician was just running late. I told them about my experience last time and was put on hold to a second operator. This operator said the same thing. At this point I told the operator that this is no problem. However, if the technician does in fact not show up then they do not have to try again. At this point the agent transferred me again to a "Loyalty Specialist".

This third person that I spoke to did in fact do some digging and figured out that when the technician who had gone to the wrong house and left, he had in fact cancelled the entire installation. And me rescheduling it with the support agent did not actually reopen that ticket so there was no technician coming. He then proceeded to say that I shouldn't worry, he knows how to restart the process again properly. At that point I said "Thank you, but no thank you. You gave it the old college try but couldn't even get a technician to my house in 2 tries so I am done".

Third try using Sonic Internet

I had discovered that Sonic Internet also resold AT&T Fiber at my location and figured that at least in that case I would deal with a support department that was prompt and knowledgeable even though I would still have to deal with AT&T for the actual installation. The same day that I cancelled the second try with AT&T I ordered Fiber from Sonic instead.

The first appointment was scheduled on March 31st. Same as the original visit that I had in September a year earlier the conduit is broken and need to be fixed. This technician managed to get the specialist team to do the second visit the same day though. They showed up as a 2-person crew and told me in refreshing detail what was going to need happening next. First an underground survey needed to be performed after which a digging crew would be dispatched to fix the conduit. I should expect the survey to happen within a few days and then the digging crew would show up in a week or two.

On April 6th I had a second appointment scheduled at which time an AT&T technician showed up to install my internet with the assumption that the fiber had by this time already been installed. Of course, the underground survey had not even happened yet, so he had to leave without anything done.

April 8th 2 big trucks with a team of 5 people showed up. They started by taking an hour lunch and after that got down to the work of digging my conduit. When I pointed out that the underground survey had not yet been done, they got a bit flummoxed and told me that unfortunately they could not do any digging until this had been completed. But the foreman told me that he had put in a rush order to make sure the survey would get done as soon as possible.

On April 12th I got an email from Sonic telling me that they had been instructed by AT&T to check that my internet was working correctly. At this point of course, there had still not been any actual work done by AT&T, so I sent an email to Sonic support letting them know this.

On April 15th I got a notification from Sonic telling me that the installation of my internet had been scheduled for April 19th. Since this sounded strange, I contacted Sonic support to tell than that I am not currently waiting for an AT&T technician, but an underground survey. What I was told is that the last people who were here marked the installation as complete (Which is why I got a notification earlier in the week making sure my internet was working correctly) and because of that they now had to start over from the beginning. Which means that a person must first come out and assess that a dig needs to happen (So starting all the way from scratch again). Was told by Sonic rep that they had gotten into a discussion with AT&T that got so heated that the AT&T rep hung up.

On April 16th I got a visit from a cheerful AT&T customer service rep asking me how I was enjoying my new AT&T Fiber internet. She got an earful of what I thought of AT&T at that moment.

April 19th comes around and I get a visit from another AT&T customer service rep to help me set up my AT&T account. I explain the situation to him, and he promises to get on the phone with his manager to see if there is anything he can do to help. While he does that the AT&T installation technician shows up. The technician asks if I am speaking for Yvonne? I tell him that I have no idea what he is talking about and he tells me that his work order says that he there to install internet for an Yvonne from Florida through the third-party provider Earthlink (Not Sonic). There is literally nothing in the work order that is correct except for my address. I do manage to get the technician on the phone with Sonic support and both escalate the issue to their managers. In the end there is nothing AT&T can do to have the technician do the work, even though he is here. He has to come back at a later date when the order has been corrected. At this point the AT&T service rep steps back and says that he will take me under his wing and sort this out for me. I pointedly ask him if that means that I would become and AT&T customer instead of Sonic. He says yes and I politely refuse.

After AT&T leave, I spend some more time with Sonic support. They promise to get back to me when this is sorted out. While this is happening, on April 22nd, another AT&T technician shows up to do a fiber install for Yvonne of Florida. Later that same day I hear back from Sonic support and they tell me that they have sorted out the issue with AT&T and that I now have an appointment for April 27th (Next Tuesday) to get this process started.

To sum up

So far AT&T have made 15 visits to my house with a total of 21 people. That does not include the visit they made to the wrong house in the second attempt with AT& or the 2 appointments they scheduled for that try. There has been no progress made whatsoever to actually installing fiber and the person that is coming on the 27th is actually the first person AT&T is dispatching for this install from their perspective.

To be continued...

Building for high availability: Measuring success

2021-03-28T09:28:00.001-07:00

Although highly available is easy to grasp conceptually it can be quite hard to define in practice. To be able to strive for higher and higher availability you will need to figure out how to measure it. To measure it you will need to define exactly how to calculate it.

A typical API service from client to service and back looks something like this. With the request starting at a client, traversing the internet before it hits the boundary of your service and then the response flows back the same way.

It is important to realize that any part of this chain can fail, and if it does, it will lead to a drop in availability as perceived by your clients. A large part of this you have no control over, and it is also fiendishly hard to even measure. If you only measure availability for requests in your service, you are missing a lot of potential failure modes. If one of your hosts goes bad it might not be able to report the metrics of failing requests or incoming network traffic stop all together.

It is often sufficient to measure availability from the first system that you have access to consistent logs from. This usually means either the gateway or if you are not using that a load balancer. If you are using Amazon API Gateway it can give you excellent request logs that are very useful for measuring availability and latency among other things. It will also emit Amazon CloudWatch Metrics that can measure availability directly both for the entire API and for individual methods.

How do you define availability?

The first thing you need to do is to separate out errors and faults in your metrics. An error is a request that could not be processed because of some problem with the contents of the request. A fault is request failure that is caused by a fault either in the communication chain or the implementation of the service. It is important to separate these out because as a service owner you have little to no control of the errors because they are due to a mistake in the client that calls you. Faults however do reflect your availability and are not dependent of mistakes made in the calling client. Worth noting though is that even though errors generally do not count against availability, they can if they represent errors that should not happen because a bug in your code. It is worth having some visibility into having an unusually high error rate.

If you are using HTTP to implement your API, errors should be any response status code between 400 and 499, faults are any status codes over 500 (Inclusive). Make sure that you implement your service to follow this pattern (Basically, do not invent your own usage pattern for the HTTP status codes). If you are using Amazon API Gateway, you get a metric for 4xx responses and a separate metric for 5xx responses. If you need better visibility into exactly what kind of error you are receiving, you can also set up a Amazon CloudWatch Logs Metric Filter on the request log from Amazon API Gateway.

How to calculate availability

Usually, availability is calculated as a percentage. This percentage represents the amount of traffic that is not faulty compared to the total request count. How exactly this percentage is calculated though is not as easy as it might sound and more on that in a bit.

When it comes to picking a goal for availability it is up to you as an engineer to come up with a goal that you are comfortable with. Another common pattern is that once you have implemented a proper availability goal and have good visibility into it on an ongoing basis you can always strive for higher by improving your goal incrementally. As an example, most Amazon Web Services have an availability Service Level Agreement (SLA) of 99.95% or higher. Most services can probably make do with a lower goal if you implement appropriate retries in your clients.

Simple Availability

The most obvious and simple way of defining this is to just use the ratio of non-fault requests divided by the total number of requests. With this definition your if you have a goal of 99.95% availability means that you should only have at most 1 faulty request for every 2000 requests. The advantage of this approach is that the value generally comes right from your metrics and is super easy to monitor and calculate. Using Amazon API Gateway this availability can be calculated directly from metrics emitted to Amazon CloudWatch Metrics. This is also a metric that is suitable for putting on a graph over time to visualize availability.

Calculating Availability for a Service Level Agreement

This way of measuring availability has its issues though because with this definition if you have otherwise perfect availability you can have an almost 4.5-hour long outage without breaking your 99.95% available goal for a year. But if you have a background level of continuous availability that is not perfect this does not generally negatively affect your consumers significantly, but it will significantly reduce the time you can have an outage before you have broken your goal. This difference become increasingly important once you start having an actual Service Level Agreement (SLA) for your service.

One way of addressing the shortcoming of the previous definition is to define your availability in the number of minutes you are above a certain minimum availability. An example of this definition would be that you measure your availability in the number of minutes you had an average availability over 99.99%. You can now have an availability SLA of 99.95% and in this case if your availability normally stays over 99.99% you get to use the full 4.5 hours long outage before you start breaking that SLA over a year. The bad news is that there is no easy way of calculating this metric without looking at each individual availability data point for every minute during the period. The same method can also be used with any period other than a minute.

Optimizing for client experience

If you are looking for the best experience for your clients though the previous methods still has their shortcomings. To illustrate this let us take an example where you introduce a bug that makes 100% of calls fail for 1% of your clients. In this example the way your API is used clients normally make an initial list request followed by 25 detail requests. But for the 1% of clients that get failing calls the initial list call fails. So for clients for whom the service works they make on average 26 calls where the failing clients only make a single call. In this case the simple available is 99 * 26 successful requests for every total 99 * 26 + 1 requests which translates to a simple availability of 99.96%. However, this hides the fact that 1% of your clients can not use your service at all.

The way to measure availability to catch cases like this is to define your availability goal per time period and per client. As an example, you can define the availability as the number of minutes where 99.5% of your customers have more than 99.99% availability. In the example above only 99% of clients have any availability which means that every minute is an unavailable minute by this metric until the bug is fixed. The bad news is that there really is no way of calculating this kind of availability without processing all your requests per minute to determine if you are in breach. So, it is by far the most complicated and expensive way of calculating availability. This method of calculation could potentially save you money for your SLA refunds though since if you apply it to SLA calculations you can keep track on which clients your service has breached the SLA on a per client basis instead of the previous method which would apply equally to all clients once in breach.

How to detect network outages

There is a problem with measuring availability by simply instrumenting the boundary of the service and that is what if you encounter an issue outside of that boundary. If your internet service provider suffers an outage it would stop all incoming traffic to your service. Your availability would still be 100% because there are no failing requests that you are aware of because they fail before they even reach a point in the communication chain that you can measure.

The solution for this problem is to create a canary that makes at least a minimum number of requests to your service in a way that imitates real client scenarios as closely as possible. This can be as simple creating a Amazon CloudWatch Events that triggers a AWS Lambda that generate traffic to your service. On top of this you need to add monitoring that alerts you when there is no traffic coming into your site. Ideally as your service grow you can trim this alarm to alert you when the traffic pattern goes below anything that is abnormally low instead of close to 0. That way you can also detect partial outages that are normally out of your control to measure. Furthermore, make sure that your canary emits metrics on the success of the calls it is making. Your canary traffic metrics will represent a true measurement of availability and latency covering the entire communication chain. It does only represent a small portion of all traffic, but it does properly measure all potential failures that a real client could encounter.

Latency as an aspect of availability

Even though technically latency does not affect your availability, it is extremely important for a good client experience. Latency can be hard to visualize. You might be tempted to believe that just taking the average of your request times will give you a good idea of what the latency of your service looks like. However, latency tend to have a very long tail and using the average generally is not best practice for ensuring that your clients have a good experience. As an example, below is the latency charted for a week of a sample service where it is aggregated as average, median (p50), p90 and p99. If you are unfamiliar with the pXX notation it denotes the percentile. The p99 graph represents how long the worst 1% request took to process.

As you can see in the example above there is a big difference between how you measure latency. The graph for maximum is cut off and goes all the way to 29 seconds in the worst-case scenario. In any environment with software defined networking and a decently high load you will be seeing strange outliers, so the maximum measurement is usually not very useful. Similarly, as you can see the average measurement can also hide issues that affect a not insignificant amount of your traffic. Using the p99 measurement to visualize your latency performance is usually a good middle ground. It includes enough of your worst behaving requests to see if you have significant issues with outliers taking a long time, but also ignores some of the more egregious network blips that can give extremely rare, but very high measurements otherwise skewing your graph.

When measuring anything using p99 aggregation another thing that is very important is the period under which you aggregate. You want to make sure that during the period you are measuring you have at least 100 measurements or more. If you do not, then p99 will be the same as maximum which leads to undesirable results. If you have at least 100 requests during the time period, you get to remove at least 1 request that is an anomaly before it affects your p99 measurement. If you have a minimum call rate of 1 call per second you will need to use a measurement period of at least 1 minute and 40 seconds or you will fall into this trap. Usually, you would use 5 minutes though if you do not have enough traffic to measure p99 for 1 minute though.

Finally, it is worth pointing out that each point in your service architecture will add latency. Same as with availability, it is important to measure the latency as close to the client as possible. Apart from using canaries you can rarely measure it from the client, but usually the gateway is a good place to collect latency measurements that are a good representation of your general client experience.

Create an availability dashboard

Your goal should always be to strive for higher and higher availability. To reach for this goal though you need to have visibility into what your current availability actually is. At minimum this requires you to monitor at the following on a continuous basis.

Availability - The number of faults divided by the total requests coming into your service.
Error rate - The rate of invalid requests that you are receiving. Even though this can be a false alarm, it can be an indication of a faulty deployment causing existing traffic to now fail if you see an unexpected change in the rate.
Transactions per second (TPS) - The number of requests coming into your service. The key thing you want to look at here is if there is a precipitous drop because that likely means a network failure that has occurred before you can measure it. A large, unexpected increase in traffic could also be an indication of a denial of service attack.
Latency - You should have goals on your latency and strive to decrease it. The way to have and keep these goals is to put them on a dashboard to make sure that you are aware of any changes in trends. If your service has different classes of operations that have significantly different latency profiles, you might consider separating each one out as a separate graph.

Below is an example dashboard that you can implement if you are using Amazon API Gateway as the gateway for your API.

Here is the definition of this dashboard in Amazon CloudWatch Dashboards. All you need to do is change the metric dimension of ApiName from YourAwesomeApi to whatever your API is called and reuse it. You might also need to tweak your minimum TPS limit and error rate amounts to something suitable for your traffic patterns.

  {
    "widgets": [
      {
        "height": 6,
        "width": 12,
        "y": 0,
        "x": 0,
        "type": "metric",
        "properties": {
          "metrics": [
              [ { "expression": "100*(1-m1)", 
                  "label": "Availability",
                  "id": "e1", "region": "us-east-1" } ],
              [ "AWS/ApiGateway", "5XXError", 
                "ApiName", "YourAwesomeApi", 
                { "id": "m1", "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Average",
          "period": 60,
          "title": "API Availability",
          "yAxis": { "left": {
            "min": 99.7, "max": 100, "showUnits": false, "label": "%"
          } },
          "annotations": { "horizontal": [
            { "label": "Goal > 99.95%", "value": 99.95 }
          ] }
        }
      },
      {
        "height": 6,
        "width": 12,
        "y": 0,
        "x": 12,
        "type": "metric",
        "properties": {
          "metrics": [
              [ { "expression": "m1 * 100", 
                  "label": "Error Rate", 
                  "id": "e1", "region": "us-east-1" } ],
              [ "AWS/ApiGateway", "4XXError", 
                "ApiName", "YourAwesomeApi", 
                { "id": "m1", "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Average",
          "period": 60,
          "title": "Error Rate",
          "yAxis": { "left": {
             "min": 0, "max": 10, "label": "%"
          } },
          "annotations": { "horizontal": [
            { "label": "Error Rate < 5%", "value": 5 }
          ] }
        }
      },
      {
        "type": "metric",
        "x": 0,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
              [ { "expression": "m1 / PERIOD(m1)", 
                  "label": "TPS", "id": "e1" } ],
              [ "AWS/ApiGateway", "Count", 
                "ApiName", "YourAwesomeApi",
                { "id": "m1", "period": 60, "visible": false } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "Sum",
          "period": 300,
          "title": "Request Rate",
          "yAxis": { "left": {
            "min": 0, "showUnits": false
          } },
          "annotations": { "horizontal": [
            { "label": "TPS > 20", "value": 20 }
          ] }
        }
      },
      {
        "type": "metric",
        "x": 12,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
              [ "AWS/ApiGateway", "Latency", 
                "ApiName", "YourAwesomeApi", 
                { "label": "p99 Latency" } ]
          ],
          "view": "timeSeries",
          "stacked": false,
          "region": "us-east-1",
          "stat": "p99",
          "period": 60,
          "start": "-P7D",
          "end": "P0D",
          "title": "Latency",
          "yAxis": { "left": {
            "min": 0, "label": "Milliseconds", "showUnits": false
          } },
          "annotations": { "horizontal": [
            { "label": "Latency < 1s", "value": 1000 }
          ] }
        }
      }
    ]
  }

Summary

Do:

Count faults against your availability
Have a canary to always have some traffic
Measure availability and latency as close to the client as possible
Have a dashboard that shows at minimum faults, errors, requests over time, and p99 latency

Don't:

Count errors against your availability
Aggregate latency on average, max or median.
Measure availability or latency from your service implementation

Building for high availability: Security

2021-03-24T21:38:00.000-07:00

Courtesy www.bluecoat.com I have a plan on doing a series of concerning things to think about when designing, building, and operating systems and services with reliability and high availability in mind. I will focus specifically on building services on a cloud services and my examples will generally be AWS, because that is what I know best. But most of the general principles should translate to any cloud provider of sufficient minimum functionality.

It is worth pointing out that the advice here is specifically for reliability and high availability. If, for instance, your goal is to be able to do rapid prototyping or being able to quickly go to market the advice would be very different (Perhaps I will do another series of posts on that once I am done with this topic). Sometimes it can be hard to explain to your Product Manager that even though somebody created a working prototype of something in less than a week it will still take 2 months to create the real thing, and this is one of the reasons why. As a preview of the difference between the two is that you can skip this entire section if you are only creating a prototype because security really does not matter for that (But be wary of the risk of the prototype making it to production, because then you would not have wanted to have skipped it).

There are many different things that can affect the availability of a service or site that you are building but probably the first and most important one is to make sure that your site is secure. Other failures, although severe would not result in the kind of disaster that a security failure could lead to. Not only could your entire service be taken offline or deleted, but all data you have stored could also be let loose on the dark web.

Defense in depth

The key for designing for security is defense in depth. You should not assume that you can establish a perimeter around your service and trust everything inside the service. Instead, you should consider how you can make each subcomponent as secure as possible. This will mean that if one of your components do get compromised it will not necessarily mean that your entire service or all your data is compromised. Additionally, by having each component always validating and logging access appropriately also means that any potential breach in one component can be detected earlier when an attacker unsuccessfully tries to extend the breach to other components.

The Least Privilege Principle

Each component should only have the minimum privileges needed to perform its job. If you have a component that needs permission to read a specific S3 bucket to perform its job, only grant read access to that specific bucket and not any bucket in your account nor allow it to do anything but reading from S3. Same thing goes to database access. This way if a component does get compromised only the data available to that component is potentially put at risk instead of all the data in your service.

Avoid fixed credentials

In AWS, most services allow you to grant permissions based on your execution environment such as EC2, ECS or Lambda execution roles without the need to distribute any credentials. This is a great feature that avoids the possibility of any credentials being lost in the wind and turning up in the wrong places.

If you do have to use fixed credentials such as to a RDBMS, then make sure that these credentials are automatically rotated often so that for instance ex-employees will not accidentally retain credentials to your systems.

In the case of AWS make sure you take advantage of the strong authentication options for the AWS console. And heed the advice of never using the root credentials for anything.

Limit your attack surface

Do not have any component of your service available from the internet that does not absolutely need to. Usually this would mean only your public API and your website being accessible through the internal.

Make sure that all your internal components are only available to the other internal components that need to communicate with them. In AWS you can accomplish this either through internal API:s inside of a VPC, or you can use AWS secured primitives to communicate between components such as queues or event buses.

If you need to be able to get access to the internal network for operational reasons make sure that all this access goes through a Bastion hosts that is truly locked. In AWS consider not using a Bastion host at all and instead rely on the System Manager Run & ECS Exec functionality to avoid the bastion host all together.

Avoid managing your own infrastructure and have a patching strategy

Using managed versions of almost any services means that when there is a problem with that service it is not your problem to fix it anymore, instead there is a specialist team available to handle the issue and you can just sit back and wait for the issue to be resolved. Granted, it does mean that you lose some control. But general the headache of needing to have a specialist on hand for every component you use in a complex system. It also means that for every component you have you need to have a comprehensive upgrade and patching strategy. In today's environment you must be prepared to be able to patch within hours of a critical vulnerability if not sooner or risk complete compromise of that component as evidenced most recently in the massive Exchange Service hack that has compromised at least 30k corporate email servers. If you are using managed services for your components the headache of patching, especially security vulnerabilities, is entirely handled for you.

This also extends to trying to use alternative methods of compute such as AWS Fargate and AWS Lambda to remove the burden of patching any OS that you are deploying your code on. That said, you are still responsible for patching your own code and making sure you are not relying on libraries that have known vulnerabilities in them. Using the Github code repository will provide you with automated vulnerability scanning for your code though if you are using standard dependency managers.

Encrypt everything

Always encrypt everything you save both in transit and at rest. Any intra component communication should always use TLS. Almost all AWS primitives that store data will have an option to encrypt data at rest using your own provided KMS key or at least a service owned key. Quite often though this functionality does need to be turned on explicitly, make sure you do this. Furthermore, make sure that the access to the keys for data that is sensitive is only provided to the components that need it. This is an extension of the Least Privilege Principle above. If an adversary does break into your system, this is another way that you can minimize the amount of data that is accessible and exfiltratable.

Pick the right tool for the job

When building a new system, it is important to pick the right language and framework because some are simply safer by design that others.

The first kind of language that is unsuitable is any language that contains unchecked primitives for direct memory access. This group includes languages such as C, C++ and obviously assembly language. The main danger with these kinds of languages is that it is just too easy to make a mistake and create a buffer overflow issue.

The second kind of language and or framework to avoid are languages that do too much "magic" to help you be productive. Most frameworks that involve Ruby or PHP fall in this category in my opinion. Not only do these languages lead to hard to maintain code, because it is very hard to understand the real ramifications of a change. Because so much is happening underneath the hood that you as a developer are probably not aware of, it is very hard to ensure that this "magic" is not doing something that will also lead to a security vulnerability.

Languages that I generally find suitable for building internet facing services include Java, C#, Python and Typescript. This is not an exhaustive list though and there are many more.

Avoid SQL

This is really a special case to call out in this section. The tip to avoid RDBMS:s will come up repeatedly during this serious of blog posts because they are generally not suitable for building high availability systems for many reasons. However, this specific tip is not specifically about RDBMS:s but about using any kind of database with the SQL query language. Regarding security, probably the most common reason for security breaches today is still SQL injection attacks and this kind of attack is only possible if your underlying database access language is SQL. There are almost always better choices for databases than SQL for your specific use case. Educate yourself on your options and pick anything that is not SQL. By doing this you also have the added benefit of removing even the possibility of being the target of this entire class of attacks.

Various other security related tips and tricks

This section contains some additional tips and tricks that might be more AWS specific for helping you to build secure services.

Be wary of deleting

Some cloud primitives such as S3 related to storing data allow you to not be able to delete or overwrite data. If you enable versioning in S3 and remove the permission to delete data, all together and instead use life cycle rules to expire data you can remove the threat of ransomware all together from that portion of your system. Similarly enable deletion protection to all other aspects of your infrastructure if available such as Cloud Formation stacks. This will protect you both from intentional vandal acts, but also unintentional accidents that could potentially take down your service by accidentally deleting critical infrastructure.

Safety of a crowd

When implementing your service perimeter take advantage of a managed components that sit between your service and the internet to protect yourself against both carefully crafted payloads designed to attack your service and also being able to weather the massive load of a DDOS attack. Examples of these kinds of services is not just AWS WAF, but also services such as Amazon S3, Amazon CloudFront and Amazon API Gateway. This does not include services that are simple load balancers though as these generally are provisioned to handle a single routing task explicitly and even though they do scale, it is at a slower rate and they also generally do not protect you against any kind of malicious payloads as the other services might.

Limit internet access from your components

Assuming the worst, that an adversary has broken into your system, one way that you can limit the damage that can be done is to remove access to the internet from inside your system. Quite often a service only needs to be accessible from the internet through a load balancer and all the internal components only really need to talk to other services of your cloud provider. If this is the case for you, using AWS PrivateLink for accessing the AWS services needed and otherwise have no internet connectivity from your internal service network will greatly increase the difficulty of any attacker to exfiltrate any data that they may have gained access to.

Summary

Do:

Implement defense in depth
Encrypt everything
Limit attack surface
Use the right language and framework

Don't:

Manage your own infrastructure if you can avoid it
Use fixed credentials
Use SQL

Announcing the first public release of Underscore Backup

2020-03-03T00:58:00.000-08:00

The first Underscore Backup pre-release is available for immediate download from Github.

Public key based encryption allows continuously running backups that can only be read with a key not available on the server running the backup.
Pre-egress encryption means no proprietary data leaves your system in a format where it can be compromised as long as your private key is not compromised.
Runs entirely without a service component.
Designed from the ground up to manage very large backup sets with multiple TB of data and millions of file in a single repository.
Multi-platform support based on Java 8.
Low resource requirements, runs efficiently with only 128MB of heap memory.
Efficient storage of both large and small file with built in de-duplication of data.
Handles backing up large files with small changes in them efficiently.
Optional error correction to support unreliable storage destinations.
Encryption, error correction and destination IO are plugin based and easily extendable.
Currently supports local file and S3 for backup destinations.

Best of all it is available as open source for free under a GPLv3 license.

For now this software is still under heavy development and should not be relied upon to protect production data.

Released Your Shared Secret Service

2019-04-15T22:46:00.002-07:00

I recently published Your Shared Secret service which allows you to safely and securely ensure that private information that you have is not lost if you are in any way incapacitated.

The basic premise is that information is submitted through your browser where it is encrypted before it is ever sent to the service. The key to decrypt the information never leaves your browser. The key is then then chopped up into multiple pieces which are securely handed out to a number of people that you chose to act on your behalf and only by a group of them collaborating (You chose how many) can they together assemble the key required to access your information. For a quick introduction you can check out this video.

I really went all out on the privacy aspect of this website and service and have gone out of my way to not collect any information not needed for the operation of the service. The site have no third party links except for when collecting payments and it does not collect any visitor analytics such as for instance Google Analytics.

You have complete control over who of your caretakers is able to initiate accessing the information and also how many of the total group of caretakers need to participate to access the information. Even better the service does not even need know how to contact the caretakers. This information is only known by the unlocking caretaker and the owner of the information.

Furthermore the act of one of your caretakers trying to assemble the key will give you as the creator a notification that allows you to cancel the unlocking information or delete the information all together within a 7 days quarantine period. For more information on how the service works see the Usage section on the website.

The entire service operates on a zero trust model where all the functionality is ensured with cryptographically strong primitives with the single exception being the 7 day quarantine period. There is plenty of detailed information on how the encryption works in detail and how the service has been built. To ensure what is claimed on the site is actually what is happening the source code to the entire service is published on GitHub and you can even run the entire website locally by just cloning the website repository and just run.

npm start

The service is available for an introductory price of $1 or if you want to have completely anonymity you can also pay with either Etherium or Bitcoin although that is slightly more costly because of the value fluctuations of these currencies.

The service is available now so feel free to get started now keeping your information safe without you.

How to maintain work life balance and sanity in the tech industry

2018-12-06T00:34:00.001-08:00

As you collect experience and hopefully get more senior in your position you are at some point undoubtedly going to get to a point where this is something you will need to start thinking about.

I surprisingly often get asked by junior colleagues about how I deal with this, with the implication (At least in my mind) that I seem to have gotten some things right in their eyes so I figured I will try to share some of my insights, tips and tricks in this area.

Work with something you love

This might sound kind of obvious, but also I have found a lot of people don't like what they do. The key here is that you want to work with something that you don't mind doing if you have nothing else going on. The other part of this tip is that when you don't have anything else going on, try to have the discipline to actually work. What this allows you to do is not work when you do have other things you would like to do.

Make sure you also have a boss that realizes that as long as you get your stuff done, it doesn't matter how or when you do it. If this isn't the case for you, start looking for another place to work.

Set boundaries

In my entire life I don't think I have ever worked on a project that had enough time to get everything that we wanted done in the time allotted to do it. With that you have to realize that very few people will tell you to work less, it is up to you to set the boundaries of how much you work. There is also a point of diminishing returns for when the quantity of work put in doesn't actually increase your productive output.

You also shouldn't compare yourself to your colleagues too much, especially when it comes to the quantity of work put in. First of all there have been tons of studies that have concluded that when people estimate how much work they have put in they are usually over estimating it. So when a coworker of your tell you that they have worked 80 hour weeks, take that with a grain of salt. Secondly people might be putting in a lot of time without actually getting a lot of things done. Be the person who works smart instead of hard.

As an example think of the person who works 80 hours a week, week after week, making sure that a system that is unstable stays healthy and when it isn't constantly nurses it back to health. Compare this to the person who instead figures out the few defects in the system that causes it to be unstable and fixes them so the system now runs more or less by itself. Who of these two would you think was most valuable to the team?

Manage distractions

Another aspect of being able to manage your work life balance is to make sure that when you work, you are as productive as possible. In today's world there are so many things that are constantly trying to pull your attention away from what you are supposed to be working on. Especially as you become more senior it will get more important to manage your distractions effectively so that you can actually get things done.

In my own case I have several ways in which people can connect with me and I control obsessively what methods are actually able to notify me immediately instead of me only seeing it when I check them.

My pager (Piece of software on my phone). This is the only thing that I allow to wake me up if I sleep.
Phone calls and text messages. This is the only other thing on my phone that is allowed to either vibrate or make a sound. The one exception is that I allow my calendar to make a tiny chirp when I have a meeting. Apart from that my phone is silent.
Chat applications. I don't let these in any way interrupt me. This includes no visual popups or sounds in any way. They are the first things I check when I take a break from work to see if anybody needs anything from me though.
Email. It amazes me that almost everybody I know allows mail to both make a sound and show a popup on their computer. For me email is something I check a few times a day and my personal goal is to read emails within no more than a day after it is sent. If you need something faster from me you would need to ping me in another way or just get lucky.

All this comes down to trying to get as many prolonged periods of time that allow you to actually focus on whatever problem you have at hand without being constantly interrupted. When you inevitably drift off and lose focus, then you go and check if anybody needs something from you. Not the other way around where you lose focus because other people need your input (Unless it is really important and time critical).

Don't hoard knowledge

An extension to managing distractions is to do your best to not hoard knowledge. First of all, if you are the only person who knows something then you are guaranteeing that people will need to bug you to figure out how things are working. Secondly, you are by extension implying that you are sure that your coworkers can in no way improve your work. Now, if that is true then I feel sorry for you because that doesn't sound like a fun place to work.

I have met several people that do this more or less unconsciously probably as a safety mechanism to improve job security. This is misguided though because it ensures that as you progress in your career you are guaranteeing that you will be spending more and more of your time maintaining old projects instead of working on new things and leaving the maintenance to other people and I have hardly ever found an engineer that would not prefer to work on new things rather than maintenance.

Also on this topic is to be open for people to come to you and ask questions and your goal should be to explain it well enough that they understand it well enough to be able to answer the same question if somebody else asks it off them.

As an ending thought I have a request for junior people reading this though to try to not ask the same question too many times before you write it down so that you don't have to ask it again (Again, just trying to manage your senior colleagues distractions).

Comparing Macbook Pro to Windows 10 based laptop for software development

2016-05-03T00:21:00.000-07:00

My post from a few years ago about Why I hate Mac and OSX is by far the most read post I have ever posted on this blog (Somebody cross-posted it to an OSX Advocacy forum and the flame war was on). So it has been a few years, both OS X and Windows has moved on since 2009 and hardware has improved tremendously. I have also started a job which more or less requires me to use a Mac laptop so I have recently spent a lot of time again working with a Mac so I figured I would revisit the topic of what I prefer to work with.

The two laptops I will be comparing specifically is a Dell Precision 7510 running Windows 10 vs a current 2015 Macbook Pro running OSX El Capitan.

Before I start the comparison I'll describe what and how I use a computer. I'm a software developer that has been working with this for decades. I prefer to use my keyboard as much as possible. If there is a keyboard shortcut, I will probably use it pretty quickly. I tend to want to automate everything I do if I can. I have great eyesight and pretty much the most important aspect a laptop is that it has a crisp high resolution screen (Preferably non glossy) which to me translates to more lines of code on the screen at the same time. So with that in mind lets get started.

Screen

This one is fortunately easy. For some bizarre reason OSX does no longer allow you to run in native resolution without installing an add-on. Even with that add-on installed the resolution is paltry 2880 by 1800 in compared to 3840 by 2160. That means that on my DELL I can fit almost twice as much text on the screen. Also Mac's are only available with a glossy screen which is another strike against it. I don't really care at all about color reproduction or anything like that, and even if I hear that the Mac is great at that (And so supposedly is the DELL) but don't care about that at all.

Windows used to have pretty bad handling of multiple screens before Windows 10, especially with weirdly high resolution. This has gotten a lot better with Windows 10. That said OSX has great handling of multiple screens, especially when you keep plugging in and out of a bunch of screens, things just seem to end up on the screen they are supposed to be when you do. Windows is much less reliable in this sense. That said, the better handling of multiple screens are nowhere near weighing up for the disaster that is the OSX handling of native resolutions or the low resolution of the retina display.

Winner: Windows

Portability

The PC is as a friend of mine referred to it "a tank". It is amazing how small and light the Macbook Pro is compared to everything that they crammed into it.

Winner: OSX

Battery Life

I can go almost a full day on my Mac, my PC I can go a couple of hours. No contest here, the Macbook Pro has amazing battery life.

Winner: OSX

Input Devices

Let me start off by saying that the track pad on the Mac is fantastic. Definitely the best I have ever used on any computer any category. That said why can't you show me where the buttons are (I hate that), the 3D touch feature is completely awful on a computer (I don't really like it on a phone either, but there it has its place). I started this review by saying that I use a lot of keyboard and when it comes to productivity there is absolutely no substitute for a track point. This is that weird little stick in the middle of the keyboard that IBM invented. The reason why it is superior is that when I need to use it I never have to move my fingers away from their typing position on the keyboard so I don't lose my flow of typing if I have to do something quickly with the mouse.

In regards to keyboards both Macbook Pro and the DELL Precision laptops have great keyboards. However, for some weird reason Macbook's still don't have page up and page down keys. And not only are there no dedicated keys for this, there isn't even a default keyboard shortcut that does this (Scroll up and scroll down which are available are not the same thing) so to get it at all you need to do some pretty tricky XML file editing. You also don't have dedicated keys for Home and End on a Macbook Pro. And given that there is so much space when the laptop is open not used by the keyboard on a 15" Macbook Pro I find it inexcusable.

Winner: Windows

Support

With my Windows machine (And this is true for pretty much any tier 1 Windows laptop supplier) I call a number or open a chat and 1 to 2 days later a guy shows up with the spare parts required to fix it. With Apple I take it to the store and then they usually have to ship it somewhere, it takes a week or two... If you are lucky. For me that would mean that I can't work for those two weeks if I didn't have a large company with their own support department to provide me with a replacement to help out where Apple falls short.

Winner: Windows

Extensibility

I can open up my PC and do almost all service myself. Dell even publishes the handbook for doing it on their support site. Replacing the CPU would be very tricky because I think it is soldered to the motherboard, but everything else I can replace and upgrade myself. I also have 64GB of memory, two hard drives and if I want to upgrade a component in a year or two it wont be a problem. The Macbook Pro has Thunderbolt 2 which is great (Although the PC has a Thunderbolt 3 port), but that is pretty much it in regards to self service upgrades.

Also my PC beats the Mac on pretty much any spec from HD speed, size, CPU, GPU, memory.

Winner: Windows

Price

Everybody talks about the Apple tax. I don't find that to be very true. A good laptop (Which don't get me wrong both of these are great laptops) costs a lot of money. And my PC cost quite a bit more than the Macbook Pro did. Granted it has better specs, but I don't think there is really any difference in price when you go high end with a laptop purchase.

Winner: Tie

Productivity

For me productivity is synonymous with simplicity and predictability. Specifically I move around a lot of different applications and I need to be able to get to them quickly, preferably through a keyboard shortcut and I want to do it the same way every time. With that in mind OSX is an unmitigated disaster in this area. First of all, you have to keep track of if the windows you want to get to is in the same application or another one. And if it is another application, you first have to swap to the application you want and then after that you need to use a different keyboard shortcut to find the specific window in the application. I do like that you can create multiple desktops and assign specific applications to specific desktop (Predictable!). However then when you go full-screen with those windows they move to another desktop and this desktop has no predictability at all of where it is placed in comparison to other ones, it is strictly the order in which they are placed. Going on, I still don't understand how OSX still doesn't have a Maximize window button that takes the window and just makes it fill the screen. There are some third party tools that helps you a bit with this madness (Like being able to maximizing windows without going full-screen for instance). And regrettably in my opinion this is an area where OSX is moving backwards where the original Exposé was actually pretty good compared to the current mess. Also I don't like having the menu bar at the top of the screen because it means that it is usually further away from where my mouse currently is which means it takes longer to get there.

Meanwhile Windows 10 in this area took a huge leap with the snapping of windows to the side and allowing you to optionally selecting another window to see on the left. And you can easily switch to any window quickly using one keyboard shortcut same as always

A side note that doesn't affect me much but it does kind of need to be stated is that unsurprisingly Microsoft Office 2016 is just so much better on Windows than OSX.

Winner: Windows

Development Environment

In regards to development environments everything Java is available for both platforms so this comes down to comparing Visual Studio to XCode as far as I think. And obviously this comes down to whether you are developing in Swift or C# but since Visual Studio has recently moved more and more into the multi platform arena this is more of a real choice every day.

XCode has improved in huge leaps and bounds since the original versions I worked with (I started working with it around version 3). However there is simply no contest here. Visual Studio is the best development environment that I know. Both when it comes to native features, and the 3rd party extension system that support it is simply amazing. The only one that might possibly come close as far as I am concerned is IntelliJ.

Winner: Windows

Command Line Interface and Scripting

This is also a very easy call. OSX is Unix based, has a real shell, PERL and SSH installed by the OS. Sure Powershell is OK, but I just don't like it. I would argue that I think the terminal emulation in Putty seems a little bit better than Terminal, but on the other hand it doesn't have tabs and it also isn't installed by default.

Winner: OSX

Software Availability

This is a tricky category because there is obviously a lot more software available on Windows than OSX. However I find OSX has a lot of really good software that isn't available on Windows in similar quality. So I'm going to call this another tie.

Winner: Tie

Reliability

You would think that this is an easy win for Mac. And for normal non power users I would say that is absolutely true. It is harder for a non technical user to mess up an OSX system than a Windows system, no question about it. I however tend to tinker with stuff that normal people wouldn't and I can say that I have managed to mess up my Mac several times to the point where it will not boot and I have to completely reinstall the OS. However, I have done the same thing more times on Windows than on OSX I think. I also am a little bit worried about Apple's general stance on solving security issues in a timely manner, something that Microsoft is actually really good it. That said, even though this is not as much of a slam dunk as you would think I still have to give this to OSX.

Another thing I would like to add in here is that pretty much every PC that I have bought there have been some part of the hardware that did not quite live up the expectations. On my previous laptop DELL Precision m4800 it was the keyboard (In 2 years I replaced it 6 times), on this one I am still working with support on fixing some flakiness with the trackpoint. I have never had similar issues with any Apple computer (Although I did have an iPad 4 where the screen just shattered when I placed it on a table for no reason).

Winner: OSX

Conclusion

If you travel a lot and need to work on battery a lot I think you might want to give the Macbook a go. It's pretty neat.

That said the clear winner for me when it comes to both productivity, usability and just raw performance is going to be a Windows machine when it comes to doing software development. The beauty with Windows is that since there are so many of them you can usually find one that fits you exactly (There are obviously PC:s that are very similar to the Macbook Pro, for instance the bezel-less Dell XPS 15 looks pretty sweet if you are looking for a PC equivalent of a Macbook Pro).

Winner: Windows

How I studied for the AWS Certified Solutions Architect Professional exam

2016-04-27T00:55:00.001-07:00

I recently took (and passed) the AWS Certified Solutions Architect Professional exam and figured I would share how I studied for this test. When I took the associate level of this exam I only had 3 days to study and very little existing experience with AWS before hand and that is definitely not how I would recommend taking these exams. For the professional level exam I had around 3 months of time from the time I started studying to when I had to pass the exam or my associate level exam would have expired.

If you are studying for the associate exam I think the study guide below would probably still work (Although it might be a bit of overkill), just skip the professional level white papers and courses on Linux Academy and Cloud Academy.

Full disclosure, I work for Amazon Web Services as of a couple of months, but the opinions expressed in here are my own.

Prerequisites

Here are the things you should already have done and know before you start thinking about this exam.

You will need a broad general knowledge in IT. If you don't have it you can probably pass the associate level exam which is more focused on only AWS specific technology. For the professional level one you will need to have a broad general IT knowledge because they will assume you have a general understanding of how WAN routing, non AWS enterprise software (For instance do you know that Oracle RAC requires multicast and EC2 does not support that).
You need to have passed the associate level exam within 2 years.
I would highly recommend that you have been using AWS for a while. This will help you wrapping your head around some of the AWS specific concepts that other services are based on easier.

Study Outline

In short here are the things I did to study this.

Start by reading all the recommended white papers listed at the official certification guide site. I would recommend reading both the professional and associate level ones, because everything you knew when you took the associate level exam you will still need for the pro level one.
Sign up for Linux Academy and start taking the classes for first the associate level course and then the professional level course. Don't forget to take the labs as well. Don't take the final quizzes yet (The ones per section are fine though).
Sign up for Cloud Academy and take their classes for associate level and professional level courses. Same thing here, wait with the final quizzes.
Once I finished all the courses I read recommended the white papers again.
Do all the final quizzes from both Cloud and Linux Academy and make sure you get a passing grade. If there are sections that you are weak in then go back and study deeper in those areas, both Linux Academy and Cloud Academy have a lot of content aside from the lectures they recommend for the CSA certification so you don't have to just listen to the same lectures over and over.
Try the sample questions from Amazon, you should be able to answer these by now. If you feel like shelling out some money for trying the sample exam go ahead. I skipped this step myself.
Sign up for the exam.
Read all the recommended white papers again the day before the exam.
Take the exam.

Additional things you might want to consider.

Amazon recommends you taking the Advanced Architecting on AWS class. I took this class about 8 months before I took the exam and even though it is a good class I don't think it is that useful for passing the exam.
Amazon sometimes have AWS CSA Professional Readiness Workshops and if you have the ability to go to one of these I would highly recommend it. I am not sure if these are held outside of AWS re:Invent conferences though. For the associate level exam I know these workshops are held quite often and they are great too.
Qwiklabs is a great resource for practicing your AWS skills. That said if you have your Linux Academy and or Cloud Academy accounts they have labs too that are included in your subscription. These labs are better though if you can afford them.

If you can I would also recommend to start a study group and get together once a week or so and do sample questions and discuss the answers from one of the sources listed above. I did this with some of my work colleagues and I found that very helpful.

Schedule

I would recommend that you plan that studying for this will take at least 2 months. I did it in roughly 3 months, but I only studied actively for about 4 to 6 of those weeks. When I studied I spent roughly two to four hours every evening. Unless you are already a whizz at AWS I doubt you can crank this into a few days, which is very doable for the associate level exam. Roughly I divided my time like this.

10%	Initial studying of the white papers.
50%	Watching the training videos on Linux Academy and Cloud Academy.
15%	Taking labs.
10%	Doing quizzes.
10%	Additional revisions based on discovered deficiencies from the quizzes.
5%	Re-reading the white papers (The second and third time I skimmed through them a lot faster than the initial deep read).

Taking the exam

Don't go until you feel you are ready, so don't schedule the exam until you feel done. At least where I live I could schedule the exam just one day out so you don't need to plan ahead for this.

I am usually a very fast test taker (I took the associate level exam in less than half the time. However time management is going to be important when you take this exam. When I took the test I finished all the questions with around 25 minutes to spare and at that point I had roughly 30% of them marked to be revisited. After going through them all again I had less than two minutes left of my time. It says that the test is 80 questions on the description, but I only had 77 questions in mine. I'm guessing number of questions vary slightly depending on how they are selected randomly.

Cloud Academy vs Linux Academy

Cloud Academy and Linux Academy have a lot of overlap and I recommend that you would subscribe to both of them for this. That said here are the advantages to each of them as far as I experienced it.

Linux Academy have more questions in the final quiz and vastly longer study material for the professional exam than Cloud Academy. The entire course in Linux Academy is around 30 hours long and the corresponding course in Cloud Academy is only around 3 hours. And this is not something that can be covered in 3 hours. Their associate level courses are much more on par.
Cloud Academy has a much better interface for doing quizzes and revisioning where after each question it tells you the answer and short extract of information about the answer with links to the AWS documentation.
Cloud Academy allows you to set the playback speed of the training videos which I like (I feel I can still assimilate information when playing these at around 1.5x speed and it saves time). Linux Academy also had occasional streaming issues in general for me requiring me to sometimes have to restart videos.
If you are a student or have an edu address Cloud Academy is a lot cheaper than Linux Academy with $9 per month. If you don't on the other side Linux Academy is cheaper than Cloud Academy with a factor of 2.
Both services are very easy to cancel once you are done with your studying in case you don't feel you need them anymore.

When all is said and done though I could probably have passed this with only Linux Academy, but Cloud Academy would not have been sufficient for me (Especially since the training material for the professional level CSA is so short). That said, I still think that the Cloud Academy course provides a valuable alternative to Linux Academy and especially if you can sign up as a student it is so cheap that there is pretty much no reason not to.

How to get the most out of your BizSpark azure credits

2015-07-14T22:58:00.000-07:00

BizSpark is arguably one of the best deals on the internet for startups. For me the key benefit that it brings is the 5 x $150 per month of free Azure credits. That said they are a little bit tricky to claim.

The first thing you need to do is claim all you BizSpark accounts and then from each of those accounts claim your Azure credits. This blog post describes this process, so start by doing that.

So after doing this you have 5 separate Azure accounts each with $150 per month of usage. However what we want is one Azure account where we can see services from all of these subscriptions at once and that requires a couple of more hoops to jump through. In the end you will end up with one account where you can see and create services from all 5 subscriptions without having to log in and out the Azure management portal to switch between them.

The first step is to pick the one account you want to use to administrate all the other accounts.
This is a bit counter intuitive, but you need to start by adding every other account as co administrators to the account from the first step. Yes, I am saying this correctly. All the other accounts need to be added as administrators to the main admin account (Don't worry, this is temporary).
The following steps need to be done for each of the accounts except for the main account from step 1.

Log into the management console using one of the four auxiliary accounts and go to settings.
Make sure you are on the subscription tab.
Select the subscription that belongs to the account you are currently logged into. It will be the one that has the account administrator set to the account you are currently logged into. If you have done this correct you should see two different subscriptions, one for the subscription you are logged in as and one from the account in step 1.
Click the Edit Directory button at the bottom.
In the image below make sure you select the directory of the main account from step 1. It shouldn't be hard because it will be the only account in the list and pre-selected. If you have already set up any co administrators to the account you will be warned that they will all be removed.

Add the account from step 1 as co administrator to this account as described in the linked to article at the top of the post.
The last step is optional but all the subscriptions will be called Bizspark and hard to keep apart so you might want to rename them.

To do this go to the Azure account portal at https://account.windowsazure.com/Subscriptions. This page tend to be very slow, so be patient following links.
Click on the subscription name. Your screen might look different depending on how many subscriptions you have.

Click on the Edit Subscription Details.

Enter the new name in the dialog presented. You can also optionally change the administrator to the account from step 1 at the top, this will remove the owning account as an administrator from the account all together (Although they are still responsible for billing).

You can now remove all the other accounts from being administrators to the main account that you added in step 2 if you want.

If you follow all these steps when you log into the account from step 1 you should be able to see all of your subscriptions at the same time in the Azure management console like in the screenshot below.

Keep in mind this does not mean that you have $750 to spend as you want. Each subscription still has a separate limit of $150 and you have to puzzle together your services as you create them to keep all of the 5 limits from running out but at least this way you have a much better overview of what services you have provisioned in one place.

Algorithm for distributed load balancing of batch processing

2015-07-09T22:17:00.000-07:00

Just for reference this algorithm doesn't work in practice. The problem is that nodes under heavy load tend to be too slow to answer to hold on to their leases causing partitions to jump between hosts. I have moved on to another algorithm that I might write up at some point if I get time. just a fair warning to anybody who was thinking of implementing this.

I recently played around a little bit with the Azure EventHub managed service which promises high throughput event processing at relatively low cost. At first it seems relatively easy to use in a distributed matter using the class EventProcessorHost and that is what all the online examples provided by Microsoft are using too.

My experience is that the EventProcessorHost is basically useless. Not only does it not contain any provision that I have found to provide a retry policy to make its API calls fault tolerant. It also is designed to only checkpoint its progress at relatively few intervals meaning that you have to design your application to work properly even if events are reprocessed (Which is what will happen after a catastrophic failure). Worse than that though once you fire up more than one processing node it simply falls all over itself constantly causing almost no processing to happen.

So if you want to use the EventHub managed service in any serious way you need to code directly to the EventHubClient interface which means that you have to figure out your own way of distributing its partitions over the available nodes.

This leads me to an interesting problem. How do your evenly balance the load of work evenly over a certain number of nodes (In the nomenclature below the work is split into one or more partitions) which can at any time have a catastrophic failure and stop processing without a central orchestrator.

Furthermore I want the behavior that if the load is completely evenly distributed between the nodes the pieces of the load should be sticky, meaning that the partitions of work currently allocated to a node should stay allocated to that node.

The algorithm I have come up with requires a Redis cache to handle the orchestration and it uses only 2 hash tables and two subscription for handling the orchestration. But any key value store that provides publish and subscribe functionality should do.

The algorithm have 5 time spans that are important.

Normal lease time. I'm using 60 seconds for this. It is the normal time a partition will be leased without generally being challenged.
Maximum lease time. Must be significantly longer than the normal lease time.
Maximum shutdown time. The maximum time a processor has to shut down after it has lost a lease on a partition.
Minimum lease grab time. Must be less than the normal lease time.
Current leases held delay. Should be relatively short. A second should be plenty (I generally operate in the 100 to 500 millisecond range). This is multiplied by the number of currently processing partitions. It can't be too low though or you will run into scheduler based jitter of partitions jumping between partitions.

Each node also should listen to two Redis subscriptions (Basically notifications to all subscribers). Each will send out a notification that is the partition being affected.

Grab lease subscription. Used to signal that the leas of a partition is being challenged.
Allocated lease subscription. Used to signal that the lease of a partition has ended when somebody is waiting to start processing it.

There are also two hash keys in use to keep track of things. Each one contains the hash field of the partition and will contain the name of the host currently owning it.

Lease allocation. Contains which nodes currently is actually processing which partition.
Lease grab. Used to race and indicate which node won a challenge to take over processing of a partition.

This is the general algorithm.

Once every time per normal lease time each node will send out a grab lease subscription notification per each partition that.

It does not yet own and which does not currently have any value set for the partition in the lease grab hash key.
If it has been more than the maximum lease time since the last time a lease grab was signaled for the partition (This is required for the case when a node dies somewhere after step 3 but before step 6 has completed). If this happens also clear the lease allocation and lease grab hash for the partition before raising the notification since it is an indication that a node has gone offline without cleaning up.

Upon receipt of this notification the timer for this publications is reset (So generally only one publication per partition will be sent during the normal lease time, but it can happen twice if two nodes send them out at the same time. Also when this is received each node will wait based on this formula.

If the node currently is already processing the partition it will wait the number of active partitions on the node currently held times the current leases held delay minus half of this delay (So basically (Locally active partitions - 1) * current leases held delay).
If the node currently is not busy processing the partition that is being grabbed the node should wait the local active partitions plus one times the current leases held delay (On so fewer words (Locally active partitions + 0.5) * current leases held delay).

Once the delay is done try to set the lease grab hash key for the partition with the conditional transaction parameter of it not being set.

Generally the node that has the lowest delay from step 2 should get this which also means that the active partitions on each node should distribute evenly among any active nodes since the more active partitions each individual node has the longer it will wait in step 2 and the less likely it is that they will win the race to own the partition lease.
If a node is currently processing a partition but did not win the race in step 2 it should immediately signal its partition to gracefully shut down and once it is shut down it should remove the lease allocation hash field for the partition. Once this is done it should also publish the allocated lease subscription notification. After that is completed this node should skip the rest of the steps.

Check by reading the lease allocation hash value to see if another node than the winner in step 3 is currently busy processing the partition. If this is the case either wait for either the allocated lease subscription notification signaling that the other node has finished from step 3b or if this does not happen wait for a maximum of maximum shutdown time and start the partition anyway.
Mark the lease allocation hash with the new current node that is now processing this partition.
Also after the minimum lease grab time remove the winning indication in the lease grab hash key for the partition so that it can be challenged again from step 1.

When I run this algorithm in my tests it works exactly as I want it. Once a new node comes online within the normal lease time the workload has been distributed evenly among the new and old nodes. Also an important test is that if you only have one partition the partition does not skip among the nodes, but squarely lands on one node and stays there. And finally if I kill a node without giving it any chance to do any cleanup after roughly maximum lease time the load is distributed out to the remaining nodes.

This algorithm does not in any way handle the case when the load on the different partitions is not uniform, in that case you could relatively easily tweak the formula in step 2 above and replace the locally active partitions with whatever measurement of load or performed work you wish. It will be tricky to keep the algorithm sticky though with these changes.

Designing for failure

2015-06-25T02:23:00.002-07:00

One of the first things you hear when you learn about how to design for the cloud is that you should always design for failure. This generally means that any given piece of your cloud infrastructure can stop working at a given time so you need to design for this when constructing your architecture and gain reliability by creating your application with redundancy so that any given part of your applications infrastructure can fail without affecting the actual functionality of the website.

Here is where it gets tricky though. Before I actually started running things in a cloud environment I assumed this meant that every once in a while a certain part of your infrastructure (For instance a VM) would go away and be replaced by another computer within a short time. That is not what designing for failure means. To be sure this happens too, but if that was the only problem you would encounter you could even design your application to deal with failures in a manual way once they happen. In my experience even in a relatively small cloud environment you should expect random intermittent failures to happen at least once every few hours and you really have to design every single piece of your code to handle failures automatically and work around them.

Every non local service you use, even the once that are designed for ultra high reliability like Amazon S3 and Azure Blob Storage can be assumed to fail a couple of times a day if you make a lot of calls to them. Same thing with any database access or any other API.

So what are you supposed to do about it. The key thing is that whenever you try to do anything with a remote service you need to verify that the call succeeded and if it didn't keep retrying. Most failures that I have encountered are transient and tend to pass within a minute or so at the most. The key is to design your application to be loosely coupled and whenever a piece of the infrastructure experiences a hiccup you just keep retrying it for a while and usually the issue will go away.

Microsoft has some code that will help you do this as well which is called The Transient Fault Handling Block. If you are using the Entity Framework everything is done for you and you just have to specify a Retry Execution Strategy by creating a class like this.

    public class YourConfiguration : DbConfiguration 
    { 
      public YourConfiguration() 
      { 
        SetExecutionStrategy("System.Data.SqlClient",
                             () => new SqlAzureExecutionStrategy()); 
      } 
    }

Then all you have to do is add an attribute specifying to use the configuration on your Entity context class like so.

    [DbConfigurationType(typeof(YourDbConfiguration))] 
    public class YourContextContext : DbContext 
    { 
    }

It also comes with more generic code for retrying execution. However I am not really happy with the interface of the retry policy functionality. Specifically, there is no way that I could figure out to create a generic log function that allows me to log the failures where I can see what is actually requiring retries. I also don't want to have a gigantic log file just because for a while every SQL call takes 20 retries each one being logged. I rather get one log message per call that indicates how many retries were required before it succeeded (Or not).

So to that effect I created this little library. It is compatible with the transient block mentioned earlier in that you can reuse retry strategies and transient exception detection from this library. It does improve on logging though as mentioned before. Here is some sample usage.

      RetryExecutor executor = new RetryExecutor();
      executor.ExecuteAction(() =>
        { ... Do something ... });
      var val executor.ExecuteAction(() =>
        { ... Do something ...; return val; });
      await executor.ExecuteAsync(async () =>
        { ... Do something async ... });
      var val = await executor.ExecuteAsync(async () =>
        { ... Do something async ...; return val; });

By default only ApplicationExceptions are passed through without retries. Also the retry strategy will try 10 times waiting for the number of previously tries seconds until the next try (Which means it will signal a failure after around 55 seconds). The logging will just write to the standard output.

Simple Soap envelope parser and generator

2015-06-20T11:55:00.000-07:00

So I figured as a followup to my previous post here is a small sample project of what I would love to find when searching for a person online that is looking for a job.

This library on GitHub is just one class to help you generate and parse a SOAP envelope, something I was surprised to see wasn't actually available in the .Net framework as a stand alone class (Or at least I haven't found it).

Its use is very simple. To create a SOAP envelope you create an instance of the class SoapEnvelope and assign the Headers and `Body` properties (And possible the Exception if you want to signal an error) and then call the ToDocument method to generate the XML document for the SOAP envelope.

To read data simply call the method SoapEnvelop.FromStream or SoapEnvelope.FromRequest and it will return the envelope it parsed from the stream or request. It does support handling GZip content encoding from the request.

Here is a simple round trip example of its use (For more examples check out the tests).

      SoapEnvelope envelope = new SoapEnvelope();
      envelope.Body = new[] { new XElement("BodyElement") };
      envelope.Headers = new[] { new XElement("HeaderElement") };
      XDocument doc = envelope.ToDocument();
      MemoryStream stream = new MemoryStream();
      doc.Save(stream);
      stream.Seek(0, SeekOrigin.Begin);
      SoapEnvelope otherEnvelope = SoapEnvelope.FromStream(stream);

To continue from the previous post from a few days ago. Even though this example is very short it does show a couple of things if I were to evaluate the author of something similar for a job.

This is somebody who actually likes to code because otherwise why would he (Or she) even have taken the time to do this.
This is somebody that cares about the quality of their code because even though this is a super simple class it contains a small test suite to make sure that it works.
This person has at least a decent grasp of the C# and .Net framework and understand how to use inheritance and interfaces to create something new (If you are a coder and doesn't know, it is scary how few people who should know this stuff, do actually know it).

What I look for when evaluating future hires

2015-06-18T16:19:00.001-07:00

Even though I am not a manager and have put a high importance of never becoming one as one of my own personal development goals I do quite often chime in on evaluating future hires both currently for permanent positions or in the past for consultancy contracts and there is one thing that it seems to be an important thing that a lot if not even most software developers are not doing that I put a high premium on when evaluating new candidates for job application.

The first thing I do when I get a resume sent to me for a prospective candidate is that I go to Google and search for their name. If I can't find a single program related name from anything they've ever done online that is a pretty big blotch on their record from my perspective.

My thinking for this is that to be good at software development and like solving problems even if you are straight out of school you will have done one of the following.

Asked a question you couldn't figure out, or even better provided an answer to a question for somebody else, on a site like Stack Overflow or CodeProject.
Create or participated in an open source project hosted on GitHub or SourceForge.
Created some weird obscure website somewhere (Doesn't really matter what it is or how much traffic it has).
Create a blog about something. It doesn't have to be old or very active, but at least you've tried.
Have some sort of presence I can find on social media, preferably with some comments I can find in relation to software development. Doesn't matter if it is Twitter, Facebook, LinkedIn, Google+ or whatever.

The more of these you can check off the better, but if I can't find you at all that is a huge red flag in my book and you would hopefully be surprised at how common this is for would be software developers.

The problem is if you haven't gotten around to anything of the above to me that signifies that you aren't that into software development and it is just something you do, and generally good coders really like to code and they do it because they like it. If I couldn't make a living for coding I would do it anyway, and most of my public presence online is based on the work I've done when I haven't been collecting a paycheck for it (Since most of the work you do when you do get paid you can't just publish online).

So my advice to anybody who wants to get started working with software development is to sign up for a free account on GitHub and just find a small itch and write an open source application to scratch it. And make sure the repository is associated with your real name so that when I or any other person involved in any hiring search for you we will find it. I can almost guarantee that it will be worth your time.

C# Task scheduling and concurrency

2015-05-07T01:00:00.000-07:00

It is very hard to figure out how the new async Task API for handling threading and concurrency works in .Net 4.5. I have dug around a lot to try and find any documentation on this topic and have mostly failed so when in doubt I decided to simply figure it by writing some test applications that checked how it actually behaved. It is important to note that this is how threading works in a console .Net 4.5 application on Windows 8.1. I would not be surprised if specific numbers of the thread model were different in a server setting, different OS version or even .Net versions. So without further ado here are my findings.

First of all if you simply start a lot of Task's that all run for a long time you quickly notice that by default the .Net runtime will allocate a minimum of 8 threads to run tasks. Then it gets interesting though because for every second that the task queue keep being full another thread is added. This keeps going all the way up to a maximum of 1023 threads. After 1023 threads have been allocated no more threads will be allocated for any reason so any remaining tasks will wait to start until a previous task has completed. If a thread executes no tasks at all for 20 seconds it will be removed from the thread pool.

There are also odd things happening with the order of which tasks are scheduled. For instance if you were to run the following code below it will run very slowly because no threads from the second for loop will be scheduled to run until the thread pool has expanded to run all tasks from the first loop concurrently (So for almost 100 seconds no processing will happen).

for (int i = 0; i < 100; i++)

{

int thread = i;

firstTasks.Add( Task.Run(() =>

{

Thread.Sleep(100);

// Do something else

secondTasks[thread].Wait();

}));

}

for (int i = 0; i < 100; i++)

{

secondTasks.Add( Task.Run(() =>

{

// Do something in the background.

}));

}

In fact if you increase the upper bound of i from 100 to 1024 this example will never finish since all the 1023 possible available threads will be taken up with this initial tasks waiting for second tasks to finish which will never be scheduled for execution because of thread exhaustion.

This might seem like a contrived example, but it is actually not that uncommon to end up with a similar scenario if you use non async code inside a task in a complicated multithreaded application. If you instead write the code below like this it will complete almost immediately and not have any issues regardless of how many iterations of the loop you make because the second thread when created within the affinity of the thread that then waits for it actually causes the second thread to be executed immediately on that thread (As long as it hasn't been scheduled to run on another thread already).

for (int i = 0; i < 100; i++)

{

Task.Run(() =>

{

Thread.Sleep(100);

Task secondTask = Task.Run(() =>

{

// Do something in the background.

});

// Do something else

secondTask.Wait();

}));

}

One last thing you have to be very careful about when it comes to task, especially when using the async syntax is that you have to realize that once you await on something there is absolutely no guarantee that once the execution continues it is on the same thread. So for instance this code is just waiting to creating a deadlock that will be really hard to track down.

object lockObj = new object();

Monitor.Enter(lockObj);

await MethodAsync();

Monitor.Exit(lockObj);

There really is no way to handle locking securely but if you absolutely need to do locking of a resource while doing async coding you could possibly use semaphores which do not require being reset from the same thread. This generally doesn't lead to good code though and generally if you think about where your synchronization code is you can avoid having locks over awaits but it might take a little bit of extra work.

Choosing your cloud provider

2015-01-25T19:15:00.000-08:00

When you start any coding project you generally need some sort of server capability even if the application your building is not a web site. When choosing your cloud provider there are several different things to think about.

First of all you need to consider if what you need is very basic and will not require a high SLA or the ability to grow with usage you are probably better off choosing a virtual private server provider. If you are fine with a Linux box these can be had extremely cheap. I used to be a customer of Letbox and at the time they provided me with a Virtual Private Server for $5/month, a price that is hard to beat. It is however important to realize that this is not a true VM, it is a specialized version of Linux similar to doing a chroot but also with quotas on memory and CPU usage. This means that these VM:s can only run Linux. That said the price is simply in a league of itself, cheaper usually than even spot instances of AWS EC2.

However once you have something slightly more complicated to run you probably want to go with a "real" cloud provider. These come in two kinds. The first level are companies providing infrastructure as a service (IaaS). This means basically providing virtual machines, storage and networking for them. It is up to you to build everything you need to run off of these primitives. Companies that offer only this kind of computing includes Skytap, Rackspace (Although Rackspace does have some platform services) and many more.

The next level up are the companies that provide platform as a service (PaaS). All of these companies also provide the infrastructure as well if you need it, but on top of this they provide useful packages that they will run for you as managed services to help creating, deploying and running your services easier. These minimally usually include.

Load balancing of traffic to multiple servers.
Auto scaling of new servers to handle varying load.
A fast and scalable NoSQL key value store.
A managed transactional database.
Web hosting.

There are as I see it three main players in this space and they are Amazon Web Services (AWS), Microsoft Azure and Google App Engine.

Of these Amazon is by far the largest. AWS started out as a mainly platform as a service offering, but now has one of the most complete set of managed services and they have by far the largest set of data centers located all around the world and one region qualified for US government work loads (Having an account on it requires you to be a US citizen so I can not use it). Their infrastructure is truly top notch, but their development tools are not great. Only a few languages have an official SDK (I myself have been missing an SDK for PERL).

Microsoft approached this space from the opposite direction from Amazon and started out by offering specific platform solutions and tightly integrating the development and deployment of Azure applications into their development tool Visual Studio. It is the only cloud provider I am aware of that for a time did not provide IaaS at all (Although they do now). The SDK and tooling for all of their products is truly excellent, especially if you are a .Net C# developer, but many other languages are supported as well. They do unfortunately and understandably run most of their infrastructure on Windows which simply is not as solid as other hyper-visors out there and if you are building a solution that requires reliable quick processing this can be a problem, especially if you have a cluster of synchronized machine this can become really problematic. These synchronization issues usually only occur a few times a month though as the service is migrated to new machines as all the VM:s running the service undergo the monthly Windows patching. However as long as your application does not rely on tight synchronization between several systems you are unlikely to notice it.

Finally there is Google. Their solution is similar to Amazon something that has grown out of their own business and they have several offerings that are obviously simply surfacing of their internal operations like for instance Big Query. Google's infrastructure is fantastic in regards to reliability and performance. They do though in my opinion offer the most narrow platform solution of the big three. What they do provide though is truly top notch, and they are also priced accordingly unfortunately.

Price wise the big three are relatively similarly priced. If your application can take advantage of the AWS spot pricing you can get away with really cheap solutions though. Google is usually the most expensive (I say usually since the prices change all the time for cloud services). One thing that could be worth investigating is if you qualify for a Microsoft Bizspark membership because if you do you will receive $150/month of free credits to use for almost anything in Microsoft Azure (And it also includes licenses to almost every product that Microsoft has in their very extensive portfolio).

How to get more free build or test minutes with your Visual Studio Online account

2015-01-24T14:05:00.001-08:00

If you are one of the lucky ones who has an MSDN or a BizSpark subscription (One of the best deals around on the internet) and use the hosted build environment of Visual Studio Online it is tricky that you only get 60 minutes of free build time a month if you want to do continuous integration (Which you should do!) using it. However I just discovered a trick how to get around this limit by accident.

First of all log into your Azure management console and then go to the tab for Visual Studio Online subscriptions.

Then click on the unlink button at the bottom when you have the subscription you want new minutes for building or testing on. You will get a warning about losing any licenses you have purchased through your Azure subscription for Visual Studio Online so if you have done that you can't use this trick.
Then click on new at the bottom left of the management screen to link your Visual Studio Online account back to your Azure subscription.

Select the Visual Studio Online account you unlinked earlier and make sure you have the correct subscription selected in the drop down (It defaults to the pay as you go subscription so you will need to change this).
Press the link button in the lower right.

That's it, if you go back your home page on Visual Studio Online you should be able to see that you have a new allotment of build and test minutes.

DISCLAIMER: You might be violating your terms of service with Microsoft by doing this and I also expect Microsoft to fix this at some point so you at your own risk.

Problems with singletons

2015-01-22T21:07:00.000-08:00

One of the most basic software design patterns is the singleton pattern and you'd think this wouldn't be one that would cause you problems but in C# it can be surprisingly difficult and I just spent a couple of hours trying to track down a bug because I hadn't implemented one properly. The place in question was using the first implementation and was being used from multiple threads that all started at the same time.

This is the simple pattern and it almost always works except when you are doing the first access from multiple threads at the same time, but when it doesn't it can be a real hard bug to find.

  internal class Simple
  {
    private static Simple instance;

    public static Simple Instance
    {
      get
      {
        if (instance == null)
          instance = new Simple();
        return instance;
      }
    }
  }

It should be pretty obvious that this class would have problems with concurrency so the simple solution is to just add a lock around the whole thing.

  internal class Lock
  {
    private static readonly object lockObj = new object();
    private static Lock instance;

    public static Lock Instance
    {
      get
      {
        lock (lockObj)
        {
          if (instance == null)
            instance = new Lock();
        }
        return instance;
      }
    }
  }

This class is simple and does work, but getting the lock has a performance penalty which makes it useful to keep looking.

  internal class DoubleLock
  {
    private static readonly object lockObj = new object();
    private static DoubleLock instance;

    public static DoubleLock Instance
    {
      get
      {
        if (instance == null)
        {
          lock (lockObj)
          {
            if (instance == null)
              instance = new DoubleLock();
          }
        }
        return instance;
      }
    }
  }

This class is a little bit more complicated, but it has the advantage that except for the very first check there is no locking required. It does rely on the assignment of a reference to a variable being an atomic assignment, but this is fortunately a valid assumption.

However, you can also use the C# runtime to help you create the singleton using a static constructor.

  internal class Static
  {
    private static Static instance = new Static();

    public static Static Instance
    {
      get
      {
        return instance;
      }
    }
  }

This is pretty much as efficient as it gets, you even got rid of the check for null. And this is thread safe as well. It does have the disadvantage of the instance of the singleton being created right before the first access of anything in the class, which might not be what you are looking for if there are more static methods in the class. The following class is based on the previous concept but does not create the singleton until the first time it is accessed.

  internal class DoubleLazy
  {
    private static class LazyLoader
    {
      public static DoubleLazy instance = new DoubleLazy();
    }

    public static DoubleLazy Instance
    {
      get
      {
        return LazyLoader.instance;
      }
    }
  }

The nested class static constructor will not be called until you read the instance. If you are running C# 4.0 or later there is a helper class that makes this easy to do as well using a lambda expression.

  internal class NewLazy
  {
    private static Lazy instance = new Lazy(() => new NewLazy());

    public static NewLazy Instance
    {
      get
      {
        return instance.Value;
      }
    }
  }

This method also allows you to check if you have instantiated the singleton or not (You can still do that with the first implementations, but it is not possible with any of the ones that using the static initializer). So which one should you chose. It might depend on different aspects, but if the only thing you care about is performance I made some relatively unscientific measuring and came up to the following list.

The simple static initializer is the absolute fastest implementation.
The nested static initializer is only slightly slower.
The simple non thread safe solution is slightly slower.
The double lock solution is only slightly slower than the the previous three.
The lazy lambda expression solutioin takes roughly 50% longer to run than any of the previous solutions.
The lock solution is roughly 150% slower than any of the first 4 solutions.

That said even the slowest solution can still perform roughly 40 million accesses to the singleton per second from a single thread on my laptop so unless you access it a lot it really doesn't matter.

Caching in a distributed applications

2015-01-21T02:12:00.000-08:00

In the basic concept caching of data is really simple. You have a small fast storage medium of limited size and in it you save a subset of items from a larger slower storage medium that you are likely to use often. The typical example would be CPU on die cache which is in any modern CPU or disk caching in any modern operating system. However even in this case it starts getting complicated when you start adding multiple cores and need to make sure data from one CPU isn't being accessed from another CPU while a new value is only available in the first CPU's on die cache.

When you start developing distributed application this problem becomes incredibly complicated and you need to really think about what kind of data you are dealing with at every moment to make sure you end up with a high performance final application. Data tend to fall into one of several different categories.

Static data that never changes, but is too large to keep in memory on every instance needing it. This is obviously the easiest kind of data since you can just keep as many as possible available in local memory and start letting old ones slip from memory once you run low or after it hasn't been used for a certain time.
Seldom changing data that isn't critical that it is always completely up to date. This can also be cached locally, but you have to make sure you throw away the cache after a certain amount of time to make sure your data doesn't become too stale. Changes can be written directly to a database since they happen relatively seldom.
Seldom changing data that need to always be read in proper transactions correctly. Since this data is changed seldom you could just not cache this data. Alternatively you could use an in memory central cache like Redis or Memcached, you just have to make sure every place that access it uses the same method, it is also kind of tricky to deal with transactional changes on in memory databases, but it can be done.
Rapidly changing data that isn't critical that it is always completely up to date. Works pretty much the same as seldom changing data except that you probably want some sort of in memory cache for the changes so that you don't have to post every change to a normal database. You can use the in memory database to help you create batches of changes that you post periodically instead of with every change.
Rapidly changing data that need to always be read atomically correct. This one is tricky and there isn't really any good way of dealing with this except if you can figure out a way in which messages that need to access the same data ends up on the same node for processing every time. Usually this can be done by routing data based on a specific key and then cache locally based on that. Since you know that all messages that need the data will always end up on one node this should be safe. You do need to make sure you properly handle the case when a processing node goes away though.

Check out this bag that my mom and wife made from old Dunkin Donuts coffee bags

2015-01-11T13:22:00.000-08:00

Check out this awesome bad that my mom and wife made from used Dunkin Donuts coffee bags! I have some more pictures over here.

Public facing hard to guess identifiers

2014-12-26T12:34:00.001-08:00

This might take some explaining of the actual problem. When applications are reporting information I would like there to at least not be possible to guess an identifier starting at 1 and have the data end up on some other users account. My goal isn't so much to guard against somebody who is intentionally trying to do misreporting, but to make it hard enough to do for all but the determined attackers.

So how do you do this, one way is to just use GUID's for every identifier but I have always hated that and it leads to bad database design at least in my opinion. So my suggestion is to just use a simple integer identifier counting upwards internally. However whenever the identifier required to report for this is displayed to an end user I take this ID and encrypt it using a secret key with AES-256. This results in a pretty much random 16 byte array that you then encode using base 64 and present to the user. Once any reporting is done you simply do the reverse so that the ID needed to be used is the base 64 encoded encrypted value. This means that it will be almost impossible to guess a valid identifier for anything coming from the outside but internally you can still just deal with regular integers of varying size for everything.

The performance hit should be negligible since AES is implemented in hardware in recent CPU's and even without it AES is really fast.