Wednesday, January 21, 2015

Caching in a distributed applications

In the basic concept caching of data is really simple. You have a small fast storage medium of limited size and in it you save a subset of items from a larger slower storage medium that you are likely to use often. The typical example would be CPU on die cache which is in any modern CPU or disk caching in any modern operating system. However even in this case it starts getting complicated when you start adding multiple cores and need to make sure data from one CPU isn't being accessed from another CPU while a new value is only available in the first CPU's on die cache.

When you start developing distributed application this problem becomes incredibly complicated and you need to really think about what kind of data you are dealing with at every moment to make sure you end up with a high performance final application. Data tend to fall into one of several different categories.

  • Static data that never changes, but is too large to keep in memory on every instance needing it. This is obviously the easiest kind of data since you can just keep as many as possible available in local memory and start letting old ones slip from memory once you run low or after it hasn't been used for a certain time.
  • Seldom changing data that isn't critical that it is always completely up to date. This can also be cached locally, but you have to make sure you throw away the cache after a certain amount of time to make sure your data doesn't become too stale. Changes can be written directly to a database since they happen relatively seldom.
  • Seldom changing data that need to always be read in proper transactions correctly. Since this data is changed seldom you could just not cache this data. Alternatively you could use an in memory central cache like Redis or Memcached, you just have to make sure every place that access it uses the same method, it is also kind of tricky to deal with transactional changes on in memory databases, but it can be done.
  • Rapidly changing data that isn't critical that it is always completely up to date. Works pretty much the same as seldom changing data except that you probably want some sort of in memory cache for the changes so that you don't have to post every change to a normal database. You can use the in memory database to help you create batches of changes that you post periodically instead of with every change.
  • Rapidly changing data that need to always be read atomically correct. This one is tricky and there isn't really any good way of dealing with this except if you can figure out a way in which messages that need to access the same data ends up on the same node for processing every time. Usually this can be done by routing data based on a specific key and then cache locally based on that. Since you know that all messages that need the data will always end up on one node this should be safe. You do need to make sure you properly handle the case when a processing node goes away though.

No comments: