

Nginx高可用负载均衡：从单点到99.99%可用性的零停机架构实战

lucky出海

2025-10-23

Nginx高可用负载均衡：从单点到99.99%可用性的零停机架构实战

适用场景 & 前置条件

适用场景:

• Web/API 服务需要高可用性保障（SLA >99.9%）
• 需要水平扩展后端服务（3+ 实例）
• 要求零停机更新与故障自动切换
• 多机房/多可用区部署架构

前置条件:

• 2+ Nginx 节点（主备或多主模式）
• Keepalived 1.4+ 或云厂商 SLB/ELB
• 后端服务至少 3 个实例（支持健康检查）
• OS: RHEL 8+ / Ubuntu 22.04+, 具备 root 权限

环境与版本矩阵

组件	版本要求	关键特性依赖	最小资源规格
Nginx	1.20+	upstream 健康检查、stream 模块	2C4G (支持 10K QPS)
Keepalived	1.4+	VRRP 协议、脚本健康检查	512M 内存
OS	RHEL 8+ / Ubuntu 22.04+	ip_vs、ipvs 内核模块	-
Kernel	4.18+	nf_conntrack 优化	-
HAProxy	2.6+ (可选)	健康检查、会话保持	2C4G

快速清单（Checklist）

• [ ] Step 1: 规划高可用架构（主备/双主/多层 LB）
• [ ] Step 2: 安装配置 Nginx 与基础负载均衡
• [ ] Step 3: 配置 upstream 健康检查与故障转移
• [ ] Step 4: 部署 Keepalived 实现 VIP 漂移
• [ ] Step 5: 配置会话保持与负载算法
• [ ] Step 6: 实施 SSL/TLS 卸载与证书管理
• [ ] Step 7: 配置监控告警与日志收集
• [ ] Step 8: 故障演练与回滚剧本

实施步骤

Step 1: 规划高可用架构

目标: 设计符合业务 SLA 的 HA 拓扑。

架构方案对比

方案	可用性	成本	复杂度	适用场景
单 Nginx	95%	低	低	开发/测试环境
主备 + VIP	99.9%	中	中	中小型生产环境
双主 + DNS	99.95%	中	中	多机房部署
云 LB + Nginx	99.99%	高	低	云环境推荐
多层 LB	99.99%	高	高	大规模集群（10K+ QPS）

Step 2: 安装配置 Nginx

目标: 部署标准化 Nginx 服务。

RHEL/CentOS 安装

# 安装 Nginx 官方 Repo
cat <<EOF > /etc/yum.repos.d/nginx.repo
[nginx-stable]
name=nginx stable repo
baseurl=http://nginx.org/packages/rhel/\$releasever/\$basearch/
gpgcheck=1
enabled=1
gpgkey=https://nginx.org/keys/nginx_signing.key
EOF

# 安装 Nginx
yum install -y nginx

# 启动并开机自启
systemctl enable --now nginx
systemctl status nginx

Ubuntu/Debian 安装

# 添加官方 PPA
apt update
apt install -y curl gnupg2 ca-certificates lsb-release
curl -fsSL https://nginx.org/keys/nginx_signing.key | gpg --dearmor > /usr/share/keyrings/nginx-archive-keyring.gpg

echo"deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] \
http://nginx.org/packages/ubuntu $(lsb_release -cs) nginx" > /etc/apt/sources.list.d/nginx.list

# 安装
apt update
apt install -y nginx

# 启动
systemctl enable --now nginx

验证安装:

nginx -v
# 输出: nginx version: nginx/1.24.0

# 测试配置
nginx -t
# 输出:
# nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
# nginx: configuration file /etc/nginx/nginx.conf test is successful

# 访问测试
curl -I http://localhost
# 输出: HTTP/1.1 200 OK

Step 3: 配置 Upstream 与健康检查

目标: 实现后端服务的负载均衡与故障自动摘除。

基础 Upstream 配置

# /etc/nginx/conf.d/upstream.conf
upstream backend_pool {
# 负载算法（默认轮询）
# least_conn;           # 最少连接
# ip_hash;              # IP 哈希（会话保持）
# hash $request_uri;    # URL 哈希

# 后端服务器列表
server192.168.1.21:8080 weight=5 max_fails=3 fail_timeout=10s;
server192.168.1.22:8080 weight=5 max_fails=3 fail_timeout=10s;
server192.168.1.23:8080 weight=3 max_fails=3 fail_timeout=10s backup;  # 备用节点

# 连接池配置
keepalive128;             # 保持 128 个空闲连接
keepalive_requests1000;   # 每连接最多 1000 请求
keepalive_timeout60s;     # 空闲连接保持 60s
}

server {
listen80;
server_name api.example.com;

location / {
proxy_pass http://backend_pool;

# 请求头透传
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# 连接复用（HTTP/1.1 必需）
proxy_http_version1.1;
proxy_set_header Connection "";

# 超时配置
proxy_connect_timeout5s;
proxy_send_timeout10s;
proxy_read_timeout10s;

# 缓冲配置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
    }

# 健康检查端点（供 Keepalived 使用）
location /health {
access_logoff;
return200"healthy\n";
add_header Content-Type text/plain;
    }
}

参数解释:

• max_fails=3: 3 次失败后标记为不可用
• fail_timeout=10s: 10 秒后重新尝试失败节点
• backup: 仅在主节点全部失败时使用
• weight: 权重值,分配更多流量给高权重节点

重载配置:

nginx -t && nginx -s reload

# 验证 upstream 状态（需编译 stub_status 模块）
curl http://localhost/nginx_status

主动健康检查（商业版 / 开源替代方案）

方案一：使用 nginx_upstream_check_module（开源模块）

# 下载 Nginx 源码与健康检查模块
cd /tmp
wget http://nginx.org/download/nginx-1.24.0.tar.gz
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git

# 编译安装
tar xf nginx-1.24.0.tar.gz
cd nginx-1.24.0
patch -p1 < /tmp/nginx_upstream_check_module/check_1.20.1+.patch

./configure --prefix=/etc/nginx \
  --add-module=/tmp/nginx_upstream_check_module \
  --with-http_ssl_module \
  --with-http_v2_module \
  --with-stream

make && make install

健康检查配置:

upstream backend_pool {
server192.168.1.21:8080;
server192.168.1.22:8080;

check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send"HEAD /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}

server {
location /upstream_status {
        check_status;
access_logoff;
    }
}

验证健康检查:

curl http://localhost/upstream_status
# 输出:
# Upstream 'backend_pool':
# server 192.168.1.21:8080  status: up    total: 1234  success: 1230  failed: 4
# server 192.168.1.22:8080  status: down  total: 567   success: 540   failed: 27

Step 4: 部署 Keepalived 实现 VIP

目标: 通过 VRRP 协议实现 Nginx 主备自动切换。

安装 Keepalived

# RHEL/CentOS
yum install -y keepalived

# Ubuntu/Debian
apt install -y keepalived

# 启动服务
systemctl enable --now keepalived

Master 节点配置

# /etc/keepalived/keepalived.conf (Master: 192.168.1.10)
global_defs {
    router_id NGINX_MASTER
    vrrp_skip_check_adv_addr
    vrrp_strict
    vrrp_garp_interval 0
    vrrp_gna_interval 0
}

vrrp_script check_nginx {
    script "/etc/keepalived/check_nginx.sh"
    interval 2        # 每 2 秒检查一次
    weight -20        # 失败时降低优先级 20
    fall 2            # 连续 2 次失败触发
    rise 1            # 1 次成功恢复
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100      # 优先级（Master 高于 Backup）
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass SecurePass2024
    }

    virtual_ipaddress {
        192.168.1.100/24  # VIP 地址
    }

    track_script {
        check_nginx
    }

    notify_master "/etc/keepalived/notify.sh MASTER"
    notify_backup "/etc/keepalived/notify.sh BACKUP"
    notify_fault  "/etc/keepalived/notify.sh FAULT"
}

Backup 节点配置

# /etc/keepalived/keepalived.conf (Backup: 192.168.1.11)
# 与 Master 配置相同,仅修改以下参数:
#   router_id NGINX_BACKUP
#   state BACKUP
#   priority 90  (低于 Master)

健康检查脚本

# /etc/keepalived/check_nginx.sh
#!/bin/bash
# 检查 Nginx 进程与端口
pgrep nginx > /dev/null 2>&1 || exit 1
nc -zv localhost 80 > /dev/null 2>&1 || exit 1

# 检查 Nginx 健康检查端点
curl -sf http://localhost/health > /dev/null 2>&1 || exit 1

exit 0

chmod +x /etc/keepalived/check_nginx.sh

# 测试脚本
/etc/keepalived/check_nginx.sh && echo"OK" || echo"FAIL"

状态切换通知脚本

# /etc/keepalived/notify.sh
#!/bin/bash
TYPE=$1
DATE=$(date'+%Y-%m-%d %H:%M:%S')

case$TYPEin
    MASTER)
echo"$DATE - Transition to MASTER" >> /var/log/keepalived-state.log
# 可选: 发送告警到 Slack/钉钉
        ;;
    BACKUP)
echo"$DATE - Transition to BACKUP" >> /var/log/keepalived-state.log
        ;;
    FAULT)
echo"$DATE - Fault detected" >> /var/log/keepalived-state.log
        ;;
esac

chmod +x /etc/keepalived/notify.sh

启动 Keepalived:

systemctl restart keepalived

# 验证 VIP（在 Master 节点）
ip addr show eth0 | grep 192.168.1.100
# 输出:
#     inet 192.168.1.100/24 scope global secondary eth0

# 测试 VIP 可达性
curl -I http://192.168.1.100
# 输出: HTTP/1.1 200 OK

故障切换测试:

# 在 Master 节点停止 Nginx
systemctl stop nginx

# 等待 2-3 秒后,在 Backup 节点检查 VIP
ip addr show eth0 | grep 192.168.1.100
# VIP 应已漂移到 Backup 节点

# 恢复 Master
systemctl start nginx
# VIP 将自动切回 Master（priority 更高）

Step 5: 会话保持与负载算法

目标: 根据业务场景选择合适的负载策略。

IP Hash（会话保持）

upstream backend_pool {
    ip_hash;  # 同一客户端 IP 始终路由到同一后端

server192.168.1.21:8080;
server192.168.1.22:8080;
server192.168.1.23:8080;
}

适用场景: 有状态应用（Session 未共享）

缺点: 单个后端故障时会话丢失,且负载不均（NAT 后单 IP 流量大）

一致性 Hash（URL/Cookie）

upstream backend_pool {
hash$request_uri consistent;  # 根据 URL 哈希

server192.168.1.21:8080;
server192.168.1.22:8080;
server192.168.1.23:8080;
}

# 或基于 Cookie
# hash $cookie_jsessionid consistent;

适用场景: 缓存场景（同一资源路由到同一后端,提高缓存命中率）

最少连接（Least Conn）

upstream backend_pool {
    least_conn;  # 选择当前活跃连接最少的后端

server192.168.1.21:8080;
server192.168.1.22:8080;
server192.168.1.23:8080;
}

适用场景: 后端处理时间差异大的场景（长连接/WebSocket）

加权轮询（默认）

upstream backend_pool {
server192.168.1.21:8080 weight=5;  # 50% 流量
server192.168.1.22:8080 weight=3;  # 30% 流量
server192.168.1.23:8080 weight=2;  # 20% 流量
}

适用场景: 后端硬件配置不同,按能力分配流量

Step 6: SSL/TLS 卸载与证书管理

目标: 在 Nginx 层统一处理 HTTPS,后端使用 HTTP 降低开销。

获取 Let’s Encrypt 证书

# 安装 Certbot
# RHEL/CentOS
yum install -y certbot python3-certbot-nginx

# Ubuntu/Debian
apt install -y certbot python3-certbot-nginx

# 申请证书（自动配置 Nginx）
certbot --nginx -d api.example.com -d www.example.com

# 验证证书
ls -l /etc/letsencrypt/live/api.example.com/
# fullchain.pem  privkey.pem  chain.pem  cert.pem

# 测试自动续期
certbot renew --dry-run

Nginx HTTPS 配置

# /etc/nginx/conf.d/ssl.conf
server {
listen80;
server_name api.example.com;
# 强制跳转 HTTPS
return301 https://$server_name$request_uri;
}

server {
listen443 ssl http2;
server_name api.example.com;

# SSL 证书
ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;

# SSL 协议与加密套件（Mozilla Intermediate 配置）
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256';
ssl_prefer_server_ciphersoff;

# SSL 会话缓存
ssl_session_cache shared:SSL:10m;
ssl_session_timeout10m;
ssl_session_ticketsoff;

# OCSP Stapling
ssl_staplingon;
ssl_stapling_verifyon;
resolver8.8.8.88.8.4.4 valid=300s;

# 安全头
add_header Strict-Transport-Security "max-age=63072000" always;
add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;

location / {
proxy_pass http://backend_pool;
# ... (其他 proxy 配置同前)
    }
}

验证 SSL 配置:

# 测试 SSL 握手
openssl s_client -connect api.example.com:443 -servername api.example.com

# 在线测试（SSL Labs）
# https://www.ssllabs.com/ssltest/analyze.html?d=api.example.com

# 验证证书有效期
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates
# 输出:
# notBefore=Oct 15 00:00:00 2025 GMT
# notAfter=Jan 13 23:59:59 2026 GMT

Step 7: 监控告警与日志收集

Prometheus + Node Exporter

# 安装 nginx-prometheus-exporter
wget https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v0.11.0/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
tar xf nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
sudocp nginx-prometheus-exporter /usr/local/bin/

# 启用 Nginx stub_status
cat <<EOF >> /etc/nginx/conf.d/status.conf
server {
    listen 8080;
    location /stub_status {
        stub_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}
EOF

nginx -s reload

# 启动 Exporter
nohup nginx-prometheus-exporter -nginx.scrape-uri=http://localhost:8080/stub_status &

# 验证指标
curl http://localhost:9113/metrics | grep nginx_

关键 PromQL 查询

# Nginx 请求速率
rate(nginx_http_requests_total[1m])

# 后端响应时间 P99
histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m]))

# 错误率（5xx）
rate(nginx_http_requests_total{status=~"5.."}[1m])
/
rate(nginx_http_requests_total[1m]) * 100

# Upstream 活跃连接数
nginx_upstream_server_connections{state="active"}

日志收集（JSON 格式）

# /etc/nginx/nginx.conf
http {
log_format json_combined escape=json
'{'
'"time_local":"$time_local",'
'"remote_addr":"$remote_addr",'
'"request":"$request",'
'"status":$status,'
'"body_bytes_sent":$body_bytes_sent,'
'"request_time":$request_time,'
'"upstream_response_time":"$upstream_response_time",'
'"upstream_addr":"$upstream_addr",'
'"http_referer":"$http_referer",'
'"http_user_agent":"$http_user_agent"'
'}';

access_log /var/log/nginx/access.log json_combined;
}

日志分析示例:

# 统计 Top 10 请求 URI
cat /var/log/nginx/access.log | jq -r '.request' | awk '{print $2}' | sort | uniq -c | sort -rn | head -10

# 计算平均响应时间
cat /var/log/nginx/access.log | jq -r '.request_time' | awk '{sum+=$1; count++} END {print sum/count}'

# 统计 5xx 错误
cat /var/log/nginx/access.log | jq -r 'select(.status >= 500) | .request'

监控与告警

Grafana 仪表盘

推荐 Dashboard:

• Nginx Overview (ID: 12708)
• Nginx Prometheus Exporter (ID: 12708)

核心面板:

• Requests/sec (按 status code 分组)
• Upstream Response Time (P50/P95/P99)
• Active Connections / Waiting Connections
• Upstream Server Health (up/down 状态)

告警规则

# prometheus-alerts.yaml
groups:
-name:nginx_alerts
rules:
-alert:NginxDown
expr:up{job="nginx"}==0
for:1m
labels:
severity:critical
annotations:
summary:"Nginx 实例 {{ $labels.instance }} 不可达"

-alert:Nginx5xxHigh
expr:rate(nginx_http_requests_total{status=~"5.."}[1m])/rate(nginx_http_requests_total[1m])>0.05
for:3m
labels:
severity:warning
annotations:
summary:"Nginx 5xx 错误率 >5%"

-alert:UpstreamDown
expr:nginx_upstream_server_up==0
for:1m
labels:
severity:critical
annotations:
summary:"后端服务器 {{ $labels.server }} 不健康"

-alert:NginxHighLatency
expr:histogram_quantile(0.99,rate(nginx_http_request_duration_seconds_bucket[5m]))>1
for:5m
labels:
severity:warning
annotations:
summary:"Nginx P99 延迟 >1s"

性能与容量

基准测试

# 使用 wrk 测试吞吐量
wrk -t4 -c1000 -d60s --latency http://192.168.1.100/

# 预期输出（2C4G Nginx）:
# Requests/sec:  15000+
# Latency (P99):  <50ms
# Transfer/sec:  10MB

# 测试 SSL 性能
wrk -t4 -c1000 -d60s --latency https://api.example.com/
# SSL 会降低 20-30% 吞吐

系统调优

# /etc/sysctl.d/99-nginx-tuning.conf
# 网络层优化
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# 文件描述符
fs.file-max = 2097152

sysctl -p /etc/sysctl.d/99-nginx-tuning.conf

# Nginx worker 进程限制
ulimit -n 100000
# 持久化: 修改 /etc/security/limits.conf

Nginx worker 配置:

# /etc/nginx/nginx.conf
user nginx;
worker_processes auto;  # 自动检测 CPU 核数
worker_rlimit_nofile100000;

events {
useepoll;
worker_connections10000;  # 每个 worker 最大连接数
multi_accepton;
}

容量规划

单 Nginx 实例理论容量:
- 并发连接数 = worker_processes × worker_connections
- QPS 上限 ≈ worker_connections / 平均响应时间(s)

示例（4C8G，响应时间 100ms）:
- 并发连接: 4 × 10000 = 40000
- QPS 上限: 10000 / 0.1 = 100000 QPS（理论值）
- 实际建议: 30000 QPS（预留 70% buffer）

安全与合规

DDoS 防护

# 限流配置
http {
limit_req_zone$binary_remote_addr zone=api_limit:10m rate=100r/s;
limit_conn_zone$binary_remote_addr zone=conn_limit:10m;

server {
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
limit_conn conn_limit 10;  # 单 IP 最多 10 个连接

# 其他配置...
        }
    }
}

访问控制

# IP 白名单
location /admin/ {
allow192.168.1.0/24;
deny all;
}

# HTTP Basic Auth
location /private/ {
auth_basic"Restricted Area";
auth_basic_user_file /etc/nginx/.htpasswd;
}

常见故障与排错

症状	诊断命令	可能根因	快速修复	永久修复
VIP 无法 ping 通	`ip addr \| grep vip`	Keepalived 未运行	重启 keepalived	检查 vrrp 配置与防火墙
后端始终路由到单台机器	`curl -I vip \| grep X-Upstream`	ip_hash 配置	切换为 least_conn	使用 Redis Session 共享
502 Bad Gateway	`tail -f /var/log/nginx/error.log`	后端服务未启动或网络不通	检查后端 `ss -tulnp`	修复后端服务/防火墙
SSL 握手失败	`openssl s_client -connect host:443`	证书过期或协议不匹配	续期证书	配置证书自动续期 cron
upstream 超时	`grep "upstream timed out" error.log`	后端处理慢	提高 proxy_read_timeout	优化后端性能或异步化
Keepalived 脑裂	两节点同时持有 VIP	网络分区或组播失败	禁用抢占模式	使用单播 + 监控告警

最佳实践（10 条）

1. 健康检查多层防护: Keepalived 检查 Nginx 进程 + Nginx 检查后端端点 + 后端自检逻辑
2. 连接池必配: upstream keepalive ≥ 后端实例数 × 32
3. 超时三段式: connect(5s) + send(10s) + read(10s),防止慢请求阻塞
4. 日志 JSON 化: 便于 ELK/Loki 采集与分析,必含 request_time/upstream_response_time
5. SSL 性能优化: 启用 http2、ssl_session_cache、OCSP stapling
6. 限流分层: 全局限流 + API 级限流 + 业务逻辑限流
7. 灰度发布: 使用 split_clients 或 upstream 权重实现流量分割
8. 监控告警三板斧: QPS、5xx率、P99延迟,告警阈值基于历史 P95 值
9. 证书自动续期: Certbot 自动 renew + systemd timer,到期前 30 天告警
10. 定期演练: 每月执行 Nginx 故障切换、后端服务下线、证书过期场景演练

附录：完整配置样例

生产级 nginx.conf

user nginx;
worker_processes auto;
worker_rlimit_nofile100000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

events {
useepoll;
worker_connections10000;
multi_accepton;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

log_format json_combined escape=json '{'
'"time":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"request":"$request",'
'"status":$status,'
'"body_bytes_sent":$body_bytes_sent,'
'"request_time":$request_time,'
'"upstream_response_time":"$upstream_response_time",'
'"upstream_addr":"$upstream_addr"'
'}';

access_log /var/log/nginx/access.log json_combined;

sendfileon;
tcp_nopushon;
tcp_nodelayon;
keepalive_timeout65;
types_hash_max_size2048;
server_tokensoff;

gzipon;
gzip_varyon;
gzip_min_length1024;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;

include /etc/nginx/conf.d/*.conf;
}

测试于: 2025-10, RHEL 8, Nginx 1.24, Keepalived 2.2

文末福利

网络监控是保障网络系统和数据安全的重要手段，能够帮助运维人员及时发现并应对各种问题，及时发现并解决，从而确保网络的顺畅运行。

谢谢一路支持，给大家分享6款开源免费的网络监控工具，并准备了对应的资料文档，建议运维工程师收藏（文末一键领取）。

备注：【监控合集】

100%免费领取

一、zabbix

二、Prometheus

内容较多，6款常用网络监控工具（zabbix、Prometheus、Cacti、Grafana、OpenNMS、Nagios）不再一一介绍，需要的朋友扫码备注【监控合集】，即可100%免费领取。

以上所有资料获取请扫码

备注：【监控合集】

100%免费领取

（后台不再回复，扫码一键领取）

【声明】内容源于网络

lucky出海

跨境分享圈 | 每天分享跨境干货

内容 44188

粉丝 1

lucky出海跨境分享圈 | 每天分享跨境干货

总阅读231.0k

粉丝1

内容44.2k

Nginx高可用负载均衡：从单点到99.99%可用性的零停机架构实战

Nginx高可用负载均衡：从单点到99.99%可用性的零停机架构实战

适用场景 & 前置条件

环境与版本矩阵

快速清单（Checklist）

实施步骤

Step 1: 规划高可用架构

架构方案对比

推荐架构（主备 + VIP）

Step 2: 安装配置 Nginx

RHEL/CentOS 安装

Ubuntu/Debian 安装

Step 3: 配置 Upstream 与健康检查

基础 Upstream 配置

主动健康检查（商业版 / 开源替代方案）

Step 4: 部署 Keepalived 实现 VIP

安装 Keepalived

Master 节点配置

Backup 节点配置

健康检查脚本

状态切换通知脚本

Step 5: 会话保持与负载算法

IP Hash（会话保持）

一致性 Hash（URL/Cookie）

最少连接（Least Conn）

加权轮询（默认）

Step 6: SSL/TLS 卸载与证书管理

获取 Let’s Encrypt 证书

Nginx HTTPS 配置

Step 7: 监控告警与日志收集

Prometheus + Node Exporter

关键 PromQL 查询

日志收集（JSON 格式）

监控与告警

Grafana 仪表盘

告警规则

性能与容量

基准测试

系统调优

容量规划

安全与合规

DDoS 防护

访问控制

常见故障与排错

最佳实践（10 条）

附录：完整配置样例

生产级 nginx.conf

一、zabbix

二、Prometheus