High Availability

This guide covers deploying FireBackup Enterprise in a highly available configuration with redundancy, failover, and disaster recovery capabilities.

Architecture Overview

High Availability Components

Availability Targets

Component	Target Uptime	RTO	RPO
API	99.99%	5 min	0
Workers	99.9%	15 min	0
Database	99.99%	1 min	1 min
Redis	99.9%	1 min	5 min

Load Balancer Configuration

Cloud Load Balancers

AWS Application Load Balancer

# Terraform example
resource "aws_lb" "firebackup" {
  name               = "firebackup-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnets
  security_groups    = [aws_security_group.alb.id]

  enable_deletion_protection = true
  enable_http2              = true
}

resource "aws_lb_target_group" "api" {
  name     = "firebackup-api"
  port     = 4000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    path                = "/health"
    port                = "traffic-port"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
  }
}

Google Cloud Load Balancing

# GKE Ingress with Cloud Load Balancing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: firebackup
  annotations:
    kubernetes.io/ingress.class: "gce"
    kubernetes.io/ingress.global-static-ip-name: "firebackup-ip"
    networking.gke.io/managed-certificates: "firebackup-cert"
spec:
  rules:
    - host: firebackup.example.com
      http:
        paths:
          - path: /*
            pathType: ImplementationSpecific
            backend:
              service:
                name: firebackup-api
                port:
                  number: 4000

Self-Hosted Load Balancers

HAProxy Configuration

global
    maxconn 50000
    log /dev/log local0
    chroot /var/lib/haproxy
    user haproxy
    group haproxy
    daemon

defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    retries 3
    timeout http-request 10s
    timeout queue 1m
    timeout connect 10s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 10s
    maxconn 3000

frontend https
    bind *:443 ssl crt /etc/haproxy/certs/firebackup.pem
    http-request set-header X-Forwarded-Proto https
    default_backend api_servers

frontend http
    bind *:80
    redirect scheme https code 301

backend api_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200

    server api1 10.0.1.10:4000 check inter 5s fall 3 rise 2
    server api2 10.0.1.11:4000 check inter 5s fall 3 rise 2
    server api3 10.0.1.12:4000 check inter 5s fall 3 rise 2

API Server High Availability

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: firebackup-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: firebackup-api
  template:
    metadata:
      labels:
        app: firebackup-api
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - firebackup-api
              topologyKey: kubernetes.io/hostname
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - firebackup-api
                topologyKey: topology.kubernetes.io/zone
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: firebackup-api
      containers:
        - name: api
          image: firebackup/api:latest
          resources:
            requests:
              cpu: 1000m
              memory: 1Gi
            limits:
              cpu: 4000m
              memory: 4Gi
          livenessProbe:
            httpGet:
              path: /health
              port: 4000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: 4000
            initialDelaySeconds: 5
            periodSeconds: 5

Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: firebackup-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: firebackup-api

Database High Availability

PostgreSQL with Patroni

# Docker Compose for Patroni cluster
services:
  patroni1:
    image: patroni:latest
    environment:
      PATRONI_NAME: patroni1
      PATRONI_POSTGRESQL_CONNECT_ADDRESS: patroni1:5432
      PATRONI_POSTGRESQL_DATA_DIR: /data/postgres
      PATRONI_ETCD_HOSTS: etcd1:2379,etcd2:2379,etcd3:2379
      PATRONI_REPLICATION_USERNAME: replicator
      PATRONI_REPLICATION_PASSWORD: ${REPLICATION_PASSWORD}
      PATRONI_SUPERUSER_USERNAME: postgres
      PATRONI_SUPERUSER_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres1_data:/data/postgres

  patroni2:
    image: patroni:latest
    environment:
      PATRONI_NAME: patroni2
      PATRONI_POSTGRESQL_CONNECT_ADDRESS: patroni2:5432
      # ... same as patroni1
    volumes:
      - postgres2_data:/data/postgres

  patroni3:
    image: patroni:latest
    environment:
      PATRONI_NAME: patroni3
      # ... same as patroni1
    volumes:
      - postgres3_data:/data/postgres

AWS RDS Multi-AZ

# Terraform
resource "aws_db_instance" "firebackup" {
  identifier           = "firebackup-primary"
  engine               = "postgres"
  engine_version       = "16"
  instance_class       = "db.r5.large"
  allocated_storage    = 100
  storage_type         = "gp3"
  storage_encrypted    = true

  multi_az             = true

  db_name              = "firebackup"
  username             = "admin"
  password             = var.db_password

  vpc_security_group_ids = [aws_security_group.db.id]
  db_subnet_group_name   = aws_db_subnet_group.firebackup.name

  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "sun:04:00-sun:05:00"

  performance_insights_enabled = true
  monitoring_interval         = 60
  monitoring_role_arn         = aws_iam_role.rds_monitoring.arn

  deletion_protection = true
}

PostgreSQL Read Replicas

# For read-heavy workloads
resource "aws_db_instance" "firebackup_replica" {
  count = 2

  identifier          = "firebackup-replica-${count.index}"
  replicate_source_db = aws_db_instance.firebackup.identifier
  instance_class      = "db.r5.large"

  auto_minor_version_upgrade = true

  vpc_security_group_ids = [aws_security_group.db.id]
}

Redis High Availability

Redis Sentinel

# docker-compose.yml
services:
  redis-master:
    image: redis:7-alpine
    command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}
    volumes:
      - redis-master:/data

  redis-replica1:
    image: redis:7-alpine
    command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD} --replicaof redis-master 6379 --masterauth ${REDIS_PASSWORD}
    volumes:
      - redis-replica1:/data

  redis-replica2:
    image: redis:7-alpine
    command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD} --replicaof redis-master 6379 --masterauth ${REDIS_PASSWORD}
    volumes:
      - redis-replica2:/data

  sentinel1:
    image: redis:7-alpine
    command: redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./sentinel.conf:/etc/redis/sentinel.conf

  sentinel2:
    image: redis:7-alpine
    command: redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./sentinel.conf:/etc/redis/sentinel.conf

  sentinel3:
    image: redis:7-alpine
    command: redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./sentinel.conf:/etc/redis/sentinel.conf

Sentinel configuration (sentinel.conf):

sentinel monitor mymaster redis-master 6379 2
sentinel auth-pass mymaster ${REDIS_PASSWORD}
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

Redis Cluster

# Kubernetes Redis Cluster
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
  name: firebackup-redis
spec:
  clusterSize: 3
  clusterVersion: v7
  persistenceEnabled: true
  securityContext:
    runAsUser: 1000
    fsGroup: 1000
  kubernetesConfig:
    image: quay.io/opstree/redis:v7.0.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 1000m
        memory: 1Gi
  storage:
    volumeClaimTemplate:
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 10Gi

Worker Autoscaling

Kubernetes HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: firebackup-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: firebackup-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: External
      external:
        metric:
          name: redis_queue_length
          selector:
            matchLabels:
              queue: backups
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

KEDA Scaling

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: firebackup-worker-scaler
spec:
  scaleTargetRef:
    name: firebackup-worker
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 2
  maxReplicaCount: 50
  triggers:
    - type: redis
      metadata:
        address: redis:6379
        listName: bull:backups:waiting
        listLength: "5"

Multi-Region Deployment

Active-Passive Setup

Cross-Region Database Replication

# AWS RDS Cross-Region Read Replica
resource "aws_db_instance" "firebackup_cross_region" {
  provider = aws.us_west_2

  identifier          = "firebackup-cross-region"
  replicate_source_db = "arn:aws:rds:us-east-1:123456789:db:firebackup-primary"
  instance_class      = "db.r5.large"

  vpc_security_group_ids = [aws_security_group.db_west.id]
  db_subnet_group_name   = aws_db_subnet_group.west.name

  backup_retention_period = 7
}

Backup & Disaster Recovery

Database Backup Strategy

# Kubernetes CronJob for database backups
apiVersion: batch/v1
kind: CronJob
metadata:
  name: firebackup-db-backup
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:16-alpine
              command:
                - /bin/sh
                - -c
                - |
                  TIMESTAMP=$(date +%Y%m%d_%H%M%S)
                  pg_dump $DATABASE_URL | gzip > /backup/db_${TIMESTAMP}.sql.gz
                  aws s3 cp /backup/db_${TIMESTAMP}.sql.gz s3://firebackup-dr/backups/
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: firebackup-db
                      key: url
              volumeMounts:
                - name: backup
                  mountPath: /backup
          restartPolicy: OnFailure
          volumes:
            - name: backup
              emptyDir:
                sizeLimit: 10Gi

Point-in-Time Recovery

For PostgreSQL with WAL archiving:

-- postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://firebackup-wal/%f'
archive_timeout = 60

Recovery:

# Create recovery.conf
restore_command = 'aws s3 cp s3://firebackup-wal/%f %p'
recovery_target_time = '2024-01-15 10:30:00'

Health Checks & Monitoring

Application Health Endpoints

// Health check implementation
@Controller('health')
export class HealthController {
  @Get()
  async healthCheck() {
    return {
      status: 'healthy',
      timestamp: new Date().toISOString(),
      checks: {
        database: await this.checkDatabase(),
        redis: await this.checkRedis(),
        storage: await this.checkStorage()
      }
    };
  }
}

Monitoring Stack

# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: firebackup
spec:
  selector:
    matchLabels:
      app: firebackup
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: firebackup-alerts
spec:
  groups:
    - name: firebackup.rules
      rules:
        - alert: FireBackupAPIDown
          expr: up{job="firebackup-api"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "FireBackup API is down"
            description: "API has been down for more than 1 minute"

        - alert: FireBackupHighErrorRate
          expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate detected"

        - alert: FireBackupBackupQueueBacklog
          expr: redis_queue_length{queue="backups"} > 100
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Backup queue has large backlog"

Failover Procedures

Automatic Failover

Most HA components handle failover automatically:

Component	Failover Mechanism	Time
Load Balancer	Health checks	10-30s
API Pods	Kubernetes	30-60s
PostgreSQL	Patroni/RDS Multi-AZ	30-60s
Redis	Sentinel	5-30s

Manual Failover

For controlled failovers:

# PostgreSQL - Promote replica
pg_ctl promote -D /var/lib/postgresql/data

# Redis Sentinel - Force failover
redis-cli -p 26379 SENTINEL FAILOVER mymaster

# Kubernetes - Drain node
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

Capacity Planning

Sizing Guidelines

Workload	API Pods	Workers	PostgreSQL	Redis
Small (5 projects)	2	2	db.t3.medium	2 GB
Medium (25 projects)	3	5	db.r5.large	5 GB
Large (100 projects)	5+	10+	db.r5.xlarge	10 GB
Enterprise (500+)	10+	20+	db.r5.2xlarge	20 GB

Scaling Triggers

CPU > 70% for 5 minutes → Scale up
Memory > 80% → Scale up
Queue length > 50 → Scale workers
Response time > 500ms → Investigate

System Requirements - Hardware requirements
Kubernetes Deployment - K8s configuration
Security Hardening - Security configuration
Upgrade Guide - Version upgrades

Next: Upgrade Guide - Version upgrade procedures.

Architecture Overview​

High Availability Components​

Availability Targets​

Load Balancer Configuration​

Cloud Load Balancers​

AWS Application Load Balancer​

Google Cloud Load Balancing​

Self-Hosted Load Balancers​

HAProxy Configuration​

API Server High Availability​

Kubernetes Deployment​

Pod Disruption Budget​

Database High Availability​

PostgreSQL with Patroni​

AWS RDS Multi-AZ​

PostgreSQL Read Replicas​

Redis High Availability​

Redis Sentinel​

Redis Cluster​

Worker Autoscaling​

Kubernetes HPA​

KEDA Scaling​

Multi-Region Deployment​

Active-Passive Setup​

Cross-Region Database Replication​

Backup & Disaster Recovery​

Database Backup Strategy​

Point-in-Time Recovery​

Health Checks & Monitoring​

Application Health Endpoints​

Monitoring Stack​

Alerting Rules​

Failover Procedures​

Automatic Failover​

Manual Failover​

Capacity Planning​

Sizing Guidelines​

Scaling Triggers​

Related​