Wednesday, March 24, 2021

Building for high availability: Security

I have a plan on doing a series of concerning things to think about when designing, building, and operating systems and services with reliability and high availability in mind. I will focus specifically on building services on a cloud services and my examples will generally be AWS, because that is what I know best. But most of the general principles should translate to any cloud provider of sufficient minimum functionality.

It is worth pointing out that the advice here is specifically for reliability and high availability. If, for instance, your goal is to be able to do rapid prototyping or being able to quickly go to market the advice would be very different (Perhaps I will do another series of posts on that once I am done with this topic). Sometimes it can be hard to explain to your Product Manager that even though somebody created a working prototype of something in less than a week it will still take 2 months to create the real thing, and this is one of the reasons why. As a preview of the difference between the two is that you can skip this entire section if you are only creating a prototype because security really does not matter for that (But be wary of the risk of the prototype making it to production, because then you would not have wanted to have skipped it).

There are many different things that can affect the availability of a service or site that you are building but probably the first and most important one is to make sure that your site is secure. Other failures, although severe would not result in the kind of disaster that a security failure could lead to. Not only could your entire service be taken offline or deleted, but all data you have stored could also be let loose on the dark web.

Defense in depth

The key for designing for security is defense in depth. You should not assume that you can establish a perimeter around your service and trust everything inside the service. Instead, you should consider how you can make each subcomponent as secure as possible. This will mean that if one of your components do get compromised it will not necessarily mean that your entire service or all your data is compromised. Additionally, by having each component always validating and logging access appropriately also means that any potential breach in one component can be detected earlier when an attacker unsuccessfully tries to extend the breach to other components.

The Least Privilege Principle

Each component should only have the minimum privileges needed to perform its job. If you have a component that needs permission to read a specific S3 bucket to perform its job, only grant read access to that specific bucket and not any bucket in your account nor allow it to do anything but reading from S3. Same thing goes to database access. This way if a component does get compromised only the data available to that component is potentially put at risk instead of all the data in your service.

Avoid fixed credentials

In AWS, most services allow you to grant permissions based on your execution environment such as EC2, ECS or Lambda execution roles without the need to distribute any credentials. This is a great feature that avoids the possibility of any credentials being lost in the wind and turning up in the wrong places.

If you do have to use fixed credentials such as to a RDBMS, then make sure that these credentials are automatically rotated often so that for instance ex-employees will not accidentally retain credentials to your systems.

In the case of AWS make sure you take advantage of the strong authentication options for the AWS console. And heed the advice of never using the root credentials for anything.

Limit your attack surface

Do not have any component of your service available from the internet that does not absolutely need to. Usually this would mean only your public API and your website being accessible through the internal.

Make sure that all your internal components are only available to the other internal components that need to communicate with them. In AWS you can accomplish this either through internal API:s inside of a VPC, or you can use AWS secured primitives to communicate between components such as queues or event buses.

If you need to be able to get access to the internal network for operational reasons make sure that all this access goes through a Bastion hosts that is truly locked. In AWS consider not using a Bastion host at all and instead rely on the System Manager Run & ECS Exec functionality to avoid the bastion host all together.

Avoid managing your own infrastructure and have a patching strategy

Using managed versions of almost any services means that when there is a problem with that service it is not your problem to fix it anymore, instead there is a specialist team available to handle the issue and you can just sit back and wait for the issue to be resolved. Granted, it does mean that you lose some control. But general the headache of needing to have a specialist on hand for every component you use in a complex system. It also means that for every component you have you need to have a comprehensive upgrade and patching strategy. In today's environment you must be prepared to be able to patch within hours of a critical vulnerability if not sooner or risk complete compromise of that component as evidenced most recently in the massive Exchange Service hack that has compromised at least 30k corporate email servers. If you are using managed services for your components the headache of patching, especially security vulnerabilities, is entirely handled for you.

This also extends to trying to use alternative methods of compute such as AWS Fargate and AWS Lambda to remove the burden of patching any OS that you are deploying your code on. That said, you are still responsible for patching your own code and making sure you are not relying on libraries that have known vulnerabilities in them. Using the Github code repository will provide you with automated vulnerability scanning for your code though if you are using standard dependency managers.

Encrypt everything

Always encrypt everything you save both in transit and at rest. Any intra component communication should always use TLS. Almost all AWS primitives that store data will have an option to encrypt data at rest using your own provided KMS key or at least a service owned key. Quite often though this functionality does need to be turned on explicitly, make sure you do this. Furthermore, make sure that the access to the keys for data that is sensitive is only provided to the components that need it. This is an extension of the Least Privilege Principle above. If an adversary does break into your system, this is another way that you can minimize the amount of data that is accessible and exfiltratable.

Pick the right tool for the job

When building a new system, it is important to pick the right language and framework because some are simply safer by design that others.

The first kind of language that is unsuitable is any language that contains unchecked primitives for direct memory access. This group includes languages such as C, C++ and obviously assembly language. The main danger with these kinds of languages is that it is just too easy to make a mistake and create a buffer overflow issue.

The second kind of language and or framework to avoid are languages that do too much "magic" to help you be productive. Most frameworks that involve Ruby or PHP fall in this category in my opinion. Not only do these languages lead to hard to maintain code, because it is very hard to understand the real ramifications of a change. Because so much is happening underneath the hood that you as a developer are probably not aware of, it is very hard to ensure that this "magic" is not doing something that will also lead to a security vulnerability.

Languages that I generally find suitable for building internet facing services include Java, C#, Python and Typescript. This is not an exhaustive list though and there are many more.

Avoid SQL

This is really a special case to call out in this section. The tip to avoid RDBMS:s will come up repeatedly during this serious of blog posts because they are generally not suitable for building high availability systems for many reasons. However, this specific tip is not specifically about RDBMS:s but about using any kind of database with the SQL query language. Regarding security, probably the most common reason for security breaches today is still SQL injection attacks and this kind of attack is only possible if your underlying database access language is SQL. There are almost always better choices for databases than SQL for your specific use case. Educate yourself on your options and pick anything that is not SQL. By doing this you also have the added benefit of removing even the possibility of being the target of this entire class of attacks.

Various other security related tips and tricks

This section contains some additional tips and tricks that might be more AWS specific for helping you to build secure services.

Be wary of deleting

Some cloud primitives such as S3 related to storing data allow you to not be able to delete or overwrite data. If you enable versioning in S3 and remove the permission to delete data, all together and instead use life cycle rules to expire data you can remove the threat of ransomware all together from that portion of your system. Similarly enable deletion protection to all other aspects of your infrastructure if available such as Cloud Formation stacks. This will protect you both from intentional vandal acts, but also unintentional accidents that could potentially take down your service by accidentally deleting critical infrastructure.

Safety of a crowd

When implementing your service perimeter take advantage of a managed components that sit between your service and the internet to protect yourself against both carefully crafted payloads designed to attack your service and also being able to weather the massive load of a DDOS attack. Examples of these kinds of services is not just AWS WAF, but also services such as Amazon S3, Amazon CloudFront and Amazon API Gateway. This does not include services that are simple load balancers though as these generally are provisioned to handle a single routing task explicitly and even though they do scale, it is at a slower rate and they also generally do not protect you against any kind of malicious payloads as the other services might.

Limit internet access from your components

Assuming the worst, that an adversary has broken into your system, one way that you can limit the damage that can be done is to remove access to the internet from inside your system. Quite often a service only needs to be accessible from the internet through a load balancer and all the internal components only really need to talk to other services of your cloud provider. If this is the case for you, using AWS PrivateLink for accessing the AWS services needed and otherwise have no internet connectivity from your internal service network will greatly increase the difficulty of any attacker to exfiltrate any data that they may have gained access to.



  • Implement defense in depth
  • Encrypt everything
  • Limit attack surface
  • Use the right language and framework


  • Manage your own infrastructure if you can avoid it
  • Use fixed credentials
  • Use SQL

No comments: