Todo:

Go through the following resources:

Good list for non-functional requirements

  1. The CAP Theorem (Choose between consistency or availability)
  2. Scalability
    1. Anything special that needs to be extra scalable at certain times or days?
    2. Burst scaling, user surges, etc.
    3. Is the system read-heavy or write-heavy? Plan accordingly.
  3. Environment constraints
    1. Will your system be impaired by the environment in any way?
      1. Example: google maps when you’re driving through a dead zone. You can make the decision to show offline versions of the map, cache locally, etc.
    2. Does the hardware have limitations (memory, battery, bandwidth, etc.)
  4. Security
    1. Knowing how your data is handled and secured
    2. GDPR, PII, etc.
    3. Health data, for example.
  5. Latency
    1. Important to user experience
    2. Kafka, pubsubs can help with decreasing latency
  6. Durability
    1. How important is it that your data isn’t lost?
    2. Data replication, backups
  7. Partition Tolerance

Intro

When prepping for system design, I feel like the best way to consider what to use and when is a direct result of the benefits and consideration of using that technology. I’ve decided to attempt to create a simple list of the various options and when to use them.

I’ll use the following to describe each item:

  • Short description of how it works
  • When to consider it
  • Any limitations that exist with it
  • Any extra notes about it

The list

  • Relational / SQL databases
    • Description:
      • An organized method of storing data that’s relational. Allows for table structures that represent records you insert / change.
      • Offers ACID guarantees (atomicity, consistency, isolation, durability) via transactions, locking, etc.
      • Offers query performance speed ups through index creation
    • When:
      • You have highly relational data
      • You need consistency guarantees (a read after a write returns the most up to date data)
      • You need to perform joins
    • Limitations:
      • Indexes incur latency at write time because each write needs to update the index’s data as well.
    • Extra:
      • They’re traditionally viewed as CA systems, but distributed options sacrifice the A for P, so they can also be CP systems.
  • NoSQL databases
    • Description:
      • Essentially a giant key-value store
      • Usually distributed, and allows for really good scaling.
    • When:
      • You need to store data but maybe don’t have the relational structure
      • You need to quickly serialize/deserialize data
      • You have a high number of writes and/or a large number of data in general
    • Limitations:
      • You don’t get consistency with a NoSQL database. It’s an AP system. This is because each of the nodes need to be updated with the most up to date data, which requires some “async” style workflows.
    • Extra:
      • I know AWS DynamoDB supports things like primary and secondary keys which correlate to how data is stored on disk. Can make reads faster, similar to indexes.
      • Also supports indexes, but likely fall under the same camp as the SQL ones.
      • Can also support locking via optimistic locking. Version numbers are used for this. This doesn’t make the system CP, though, since it just helps with consistency at write time.
  • Load Balancer
    • Description:
      • A machine that maintains a public IP that can receive requests, and then forwards them to relevant privately routed servers.
      • Generally, it evenly balances the traffic across multiple servers, without exposing those servers publicly.
    • When:
      • You expect a lot of traffic, and you need to balance it out horizontally
      • You need high availability
      • A small latency hit is okay
    • Limitations:
      • Your load balancer can be a single point of failure. Consider adding multiple, similar to how databases manage multiple nodes.
    • Extra:
  • Database Read Replicas
    • Description:
      • Instead of having one DB system that handles all read and write requests, we consider creating some read-only copies of the original DB.
      • Generally a main DB will get the writes, copy the data over to the replicas, and the replicas will serve the reads.
    • When:
      • You have a read-heavy system. You don’t want your write DB to be overloaded.
      • You need availability. One of the replicas can take the place of the main, or replicas can be replaced.
    • Limitations:
      • Can impact consistency. If we write to the main DB and then we read immediately after, we don’t have a guarantee that the read replica has the updated data unless we explicitly build for that.
    • Extra:
      • There’s probably multiple versions of this, with a leader/follower approach being a main one, but there’s probably also a pooled version of this. Not sure which one is industry standard.
  • Cache
    • Description:
      • Very similar to a NoSQL database in structure, as a key-value store, but typically stores frequently accessed keys somewhere in memory so responses are faster.
      • Generally placed in front of your DB layer, so it can process the requests first, in case a hit exists.
      • Can generally be easily scaled, making it a great option for distributed systems.
      • There are different types of caches:
        • Write Around write to the db, then read from the cache. If it doesn’t exist, read from the db and write the updated value to the cache.
        • Write Back write to the cache, and in batches / eventually, write back to the DB
        • Write Through write to the cache and immediately write to the DB.
    • When:
      • If you have some subset of data that is frequently accessed, you should use a cache.
      • If you need to prioritize latency, caches can help.
      • If you need to reduce strain on a database, you can use a cache as an alternative to DB replication.
      • If you have a read-heavy system, caches are good.
    • Limitations:
      • If you have uniform data accesses, the cache can actually hurt your latency due to the extra step. Eviction policies might help with this.
      • Caches are usually in-memory (this is why they speed things up). This means they’re not durable. If something happens to the cache, you lose your cache data.
      • Caches may cause consistency concerns
      • Can be a single point of failure. You can replicate caches to help with this.
    • Extra:
      • Try to keep expiry values on cache items so that they properly get removed after some amount of time.
  • CDN (Content Delivery Network)
    • Description:
      • Like a cache, but a set of servers that are geographically spread out and serve static content.
      • Leverages server geolocation to provide speed ups.
      • When user requests certain data, the CDN is hit first. If the CDN has a cache miss, the CDN will get data from your server and then place it in the cache. If another user tries to hit that same data, it will then be served from the CDN. (TTL can be sent from the server to the CDN)
    • When:
      • You have users across the world
      • You have to serve a bunch of static content (think videos)
      • You need to prioritize latency
      • You need to reduce strain on your own servers
    • Limitations:
      • Can be expensive
      • A long cache expiry time means that you may have changes to your website or content that don’t get reflected to users because your cache is technically outdated.
      • Clients may need to handle direct routing if the CDN is down
    • Extra:
      • Dynamic CDNs exist that can serve HTML based on request params.
      • I think some CDNs also allow you to provide some starting data rather than need to go full cache mode