High Availability
This guide covers deploying FireBackup Enterprise in a highly available configuration with redundancy, failover, and disaster recovery capabilities.
Architecture Overview
High Availability Components
Availability Targets
| Component | Target Uptime | RTO | RPO |
|---|---|---|---|
| API | 99.99% | 5 min | 0 |
| Workers | 99.9% | 15 min | 0 |
| Database | 99.99% | 1 min | 1 min |
| Redis | 99.9% | 1 min | 5 min |
Load Balancer Configuration
Cloud Load Balancers
AWS Application Load Balancer
# Terraform example
resource "aws_lb" "firebackup" {
name = "firebackup-alb"
internal = false
load_balancer_type = "application"
subnets = var.public_subnets
security_groups = [aws_security_group.alb.id]
enable_deletion_protection = true
enable_http2 = true
}
resource "aws_lb_target_group" "api" {
name = "firebackup-api"
port = 4000
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/health"
port = "traffic-port"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 10
}
stickiness {
type = "lb_cookie"
cookie_duration = 86400
}
}
Google Cloud Load Balancing
# GKE Ingress with Cloud Load Balancing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: firebackup
annotations:
kubernetes.io/ingress.class: "gce"
kubernetes.io/ingress.global-static-ip-name: "firebackup-ip"
networking.gke.io/managed-certificates: "firebackup-cert"
spec:
rules:
- host: firebackup.example.com
http:
paths:
- path: /*
pathType: ImplementationSpecific
backend:
service:
name: firebackup-api
port:
number: 4000
Self-Hosted Load Balancers
HAProxy Configuration
global
maxconn 50000
log /dev/log local0
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 10s
maxconn 3000
frontend https
bind *:443 ssl crt /etc/haproxy/certs/firebackup.pem
http-request set-header X-Forwarded-Proto https
default_backend api_servers
frontend http
bind *:80
redirect scheme https code 301
backend api_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server api1 10.0.1.10:4000 check inter 5s fall 3 rise 2
server api2 10.0.1.11:4000 check inter 5s fall 3 rise 2
server api3 10.0.1.12:4000 check inter 5s fall 3 rise 2
API Server High Availability
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: firebackup-api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: firebackup-api
template:
metadata:
labels:
app: firebackup-api
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- firebackup-api
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- firebackup-api
topologyKey: topology.kubernetes.io/zone
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: firebackup-api
containers:
- name: api
image: firebackup/api:latest
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 4000m
memory: 4Gi
livenessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 5
periodSeconds: 5
Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: firebackup-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: firebackup-api
Database High Availability
PostgreSQL with Patroni
# Docker Compose for Patroni cluster
services:
patroni1:
image: patroni:latest
environment:
PATRONI_NAME: patroni1
PATRONI_POSTGRESQL_CONNECT_ADDRESS: patroni1:5432
PATRONI_POSTGRESQL_DATA_DIR: /data/postgres
PATRONI_ETCD_HOSTS: etcd1:2379,etcd2:2379,etcd3:2379
PATRONI_REPLICATION_USERNAME: replicator
PATRONI_REPLICATION_PASSWORD: ${REPLICATION_PASSWORD}
PATRONI_SUPERUSER_USERNAME: postgres
PATRONI_SUPERUSER_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres1_data:/data/postgres
patroni2:
image: patroni:latest
environment:
PATRONI_NAME: patroni2
PATRONI_POSTGRESQL_CONNECT_ADDRESS: patroni2:5432
# ... same as patroni1
volumes:
- postgres2_data:/data/postgres
patroni3:
image: patroni:latest
environment:
PATRONI_NAME: patroni3
# ... same as patroni1
volumes:
- postgres3_data:/data/postgres
AWS RDS Multi-AZ
# Terraform
resource "aws_db_instance" "firebackup" {
identifier = "firebackup-primary"
engine = "postgres"
engine_version = "16"
instance_class = "db.r5.large"
allocated_storage = 100
storage_type = "gp3"
storage_encrypted = true
multi_az = true
db_name = "firebackup"
username = "admin"
password = var.db_password
vpc_security_group_ids = [aws_security_group.db.id]
db_subnet_group_name = aws_db_subnet_group.firebackup.name
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
deletion_protection = true
}
PostgreSQL Read Replicas
# For read-heavy workloads
resource "aws_db_instance" "firebackup_replica" {
count = 2
identifier = "firebackup-replica-${count.index}"
replicate_source_db = aws_db_instance.firebackup.identifier
instance_class = "db.r5.large"
auto_minor_version_upgrade = true
vpc_security_group_ids = [aws_security_group.db.id]
}
Redis High Availability
Redis Sentinel
# docker-compose.yml
services:
redis-master:
image: redis:7-alpine
command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}
volumes:
- redis-master:/data
redis-replica1:
image: redis:7-alpine
command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD} --replicaof redis-master 6379 --masterauth ${REDIS_PASSWORD}
volumes:
- redis-replica1:/data
redis-replica2:
image: redis:7-alpine
command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD} --replicaof redis-master 6379 --masterauth ${REDIS_PASSWORD}
volumes:
- redis-replica2:/data
sentinel1:
image: redis:7-alpine
command: redis-sentinel /etc/redis/sentinel.conf
volumes:
- ./sentinel.conf:/etc/redis/sentinel.conf
sentinel2:
image: redis:7-alpine
command: redis-sentinel /etc/redis/sentinel.conf
volumes:
- ./sentinel.conf:/etc/redis/sentinel.conf
sentinel3:
image: redis:7-alpine
command: redis-sentinel /etc/redis/sentinel.conf
volumes:
- ./sentinel.conf:/etc/redis/sentinel.conf
Sentinel configuration (sentinel.conf):
sentinel monitor mymaster redis-master 6379 2
sentinel auth-pass mymaster ${REDIS_PASSWORD}
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
Redis Cluster
# Kubernetes Redis Cluster
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
name: firebackup-redis
spec:
clusterSize: 3
clusterVersion: v7
persistenceEnabled: true
securityContext:
runAsUser: 1000
fsGroup: 1000
kubernetesConfig:
image: quay.io/opstree/redis:v7.0.5
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1000m
memory: 1Gi
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 10Gi
Worker Autoscaling
Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: firebackup-worker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: firebackup-worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: External
external:
metric:
name: redis_queue_length
selector:
matchLabels:
queue: backups
target:
type: AverageValue
averageValue: "5"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
KEDA Scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: firebackup-worker-scaler
spec:
scaleTargetRef:
name: firebackup-worker
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 2
maxReplicaCount: 50
triggers:
- type: redis
metadata:
address: redis:6379
listName: bull:backups:waiting
listLength: "5"
Multi-Region Deployment
Active-Passive Setup
Cross-Region Database Replication
# AWS RDS Cross-Region Read Replica
resource "aws_db_instance" "firebackup_cross_region" {
provider = aws.us_west_2
identifier = "firebackup-cross-region"
replicate_source_db = "arn:aws:rds:us-east-1:123456789:db:firebackup-primary"
instance_class = "db.r5.large"
vpc_security_group_ids = [aws_security_group.db_west.id]
db_subnet_group_name = aws_db_subnet_group.west.name
backup_retention_period = 7
}
Backup & Disaster Recovery
Database Backup Strategy
# Kubernetes CronJob for database backups
apiVersion: batch/v1
kind: CronJob
metadata:
name: firebackup-db-backup
spec:
schedule: "0 */4 * * *" # Every 4 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16-alpine
command:
- /bin/sh
- -c
- |
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
pg_dump $DATABASE_URL | gzip > /backup/db_${TIMESTAMP}.sql.gz
aws s3 cp /backup/db_${TIMESTAMP}.sql.gz s3://firebackup-dr/backups/
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: firebackup-db
key: url
volumeMounts:
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup
emptyDir:
sizeLimit: 10Gi
Point-in-Time Recovery
For PostgreSQL with WAL archiving:
-- postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://firebackup-wal/%f'
archive_timeout = 60
Recovery:
# Create recovery.conf
restore_command = 'aws s3 cp s3://firebackup-wal/%f %p'
recovery_target_time = '2024-01-15 10:30:00'
Health Checks & Monitoring
Application Health Endpoints
// Health check implementation
@Controller('health')
export class HealthController {
@Get()
async healthCheck() {
return {
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {
database: await this.checkDatabase(),
redis: await this.checkRedis(),
storage: await this.checkStorage()
}
};
}
}
Monitoring Stack
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: firebackup
spec:
selector:
matchLabels:
app: firebackup
endpoints:
- port: metrics
interval: 15s
path: /metrics
Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: firebackup-alerts
spec:
groups:
- name: firebackup.rules
rules:
- alert: FireBackupAPIDown
expr: up{job="firebackup-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "FireBackup API is down"
description: "API has been down for more than 1 minute"
- alert: FireBackupHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: FireBackupBackupQueueBacklog
expr: redis_queue_length{queue="backups"} > 100
for: 15m
labels:
severity: warning
annotations:
summary: "Backup queue has large backlog"
Failover Procedures
Automatic Failover
Most HA components handle failover automatically:
| Component | Failover Mechanism | Time |
|---|---|---|
| Load Balancer | Health checks | 10-30s |
| API Pods | Kubernetes | 30-60s |
| PostgreSQL | Patroni/RDS Multi-AZ | 30-60s |
| Redis | Sentinel | 5-30s |
Manual Failover
For controlled failovers:
# PostgreSQL - Promote replica
pg_ctl promote -D /var/lib/postgresql/data
# Redis Sentinel - Force failover
redis-cli -p 26379 SENTINEL FAILOVER mymaster
# Kubernetes - Drain node
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
Capacity Planning
Sizing Guidelines
| Workload | API Pods | Workers | PostgreSQL | Redis |
|---|---|---|---|---|
| Small (5 projects) | 2 | 2 | db.t3.medium | 2 GB |
| Medium (25 projects) | 3 | 5 | db.r5.large | 5 GB |
| Large (100 projects) | 5+ | 10+ | db.r5.xlarge | 10 GB |
| Enterprise (500+) | 10+ | 20+ | db.r5.2xlarge | 20 GB |
Scaling Triggers
- CPU > 70% for 5 minutes → Scale up
- Memory > 80% → Scale up
- Queue length > 50 → Scale workers
- Response time > 500ms → Investigate
Related
- System Requirements - Hardware requirements
- Kubernetes Deployment - K8s configuration
- Security Hardening - Security configuration
- Upgrade Guide - Version upgrades
Next: Upgrade Guide - Version upgrade procedures.