System design learning note 1: Design a system that scales to millions of users on AWS

Qi Hu
3 min readDec 24, 2021

This is a learning note from this link.

System Design Steps

EC2 + MySQL DB

Vertical scaling

Basic monitoring: CPU, MEM, IO, NETWORK

EC2 with public static IP (AWS Elastic IP)

DNS (Route 53) to map the domain to the public ip

Security

Allow incoming requests from:

  1. 80 for HTTP
  2. 443 for HTTPS
  3. 22 for SSH
  4. prevent outbound connections

DNS -> Web Server -> MySQL + Object Store

Add Object Store (e.g. S3) to store static contents

DNS, CDN, Load Balancer -> Web servers, Application Servers -> MySQL (master — slave), Object Store

Horizontal scaling

  1. Multiple Servers across multiple AZs
  2. Multiple DBs in master-slave failover mode

Load Balancer

  1. AWS ELB is highly available
  2. Terminate SSL on LB to reduce the pressure on backend servers

Application Servers separate from Web Servers

  1. web servers can run as proxy
  2. some app servers process write APIs, some process read APIs
  3. they scale independently

Add CDN such as CloudFront

DNS, CDN, Load Balancer -> Web servers, Application Servers -> MySQL (master — slave), MySQL Read replicas, Memory Cache, Object Store

First configure MySQL DB cache to see if it’s sufficient, if not use memory cache to store:

  1. frequently accessed content from mysql
  2. session data

Add read replicas for mysql to reduce load on write master

  1. add LB in front of read replicas
  2. most services are read heavy vs. write heavy

More server instances

DNS, CDN, Load Balancer -> Web servers, Application Servers -> MySQL (master — slave), MySQL Read replicas, Memory Cache, Object Store

Add auto-scaling

  1. AWS AutoScaling
  2. a group per app server type/web server type, place each group in multiple AZs
  3. set up min/max number of instances
  4. scale up/down through cw, using metrics like cpu, latency, network traffic, custom metric

Automate DevOps

  1. chef
  2. puppet
  3. ansible

Monitor metrics

  1. host level — single EC2 instance metrics
  2. aggregate level — LB stats
  3. log analysis — splunk, cloudwatch, cloudtrail
  4. external site performance — new relic
  5. incidents — pagerDuty
  6. error reporting — sentry

DNS, CDN, Load Balancer -> Web servers, Application Servers -> MySQL (master — slave), MySQL Read replicas, Memory Cache, Object Store, NoSQL

Consider using data warehouse to store long-lived data if db is too large.

  1. Redshift can comfortably handle the constraint of 1TB of new content per month

Scale memory cache if we reach 40k reads/s

Think about other scaling patterns for DBs

  1. federation
  2. sharding
  3. denormalization
  4. SQL tuning

Some data can be moved to NoSQL DB such as DynamoDb

Some processes that do not need to be done in real-time, we can do it asynchronously with queues and workers

  1. SQS + Lambda

--

--