Nginx高可用负载均衡:从单点到99.99%可用性的零停机架构实战
适用场景 & 前置条件
适用场景:
-
• Web/API 服务需要高可用性保障(SLA >99.9%) -
• 需要水平扩展后端服务(3+ 实例) -
• 要求零停机更新与故障自动切换 -
• 多机房/多可用区部署架构
前置条件:
-
• 2+ Nginx 节点(主备或多主模式) -
• Keepalived 1.4+ 或云厂商 SLB/ELB -
• 后端服务至少 3 个实例(支持健康检查) -
• OS: RHEL 8+ / Ubuntu 22.04+, 具备 root 权限
环境与版本矩阵
快速清单(Checklist)
-
• [ ] Step 1: 规划高可用架构(主备/双主/多层 LB) -
• [ ] Step 2: 安装配置 Nginx 与基础负载均衡 -
• [ ] Step 3: 配置 upstream 健康检查与故障转移 -
• [ ] Step 4: 部署 Keepalived 实现 VIP 漂移 -
• [ ] Step 5: 配置会话保持与负载算法 -
• [ ] Step 6: 实施 SSL/TLS 卸载与证书管理 -
• [ ] Step 7: 配置监控告警与日志收集 -
• [ ] Step 8: 故障演练与回滚剧本
实施步骤
Step 1: 规划高可用架构
目标: 设计符合业务 SLA 的 HA 拓扑。
架构方案对比
推荐架构(主备 + VIP)
┌─────────────┐
│ VIP 浮动IP │
│ 192.168.1.100│
└──────┬───────┘
│ (Keepalived VRRP)
┌─────────────────┴─────────────────┐
│ │
┌─────▼──────┐ ┌─────▼──────┐
│ Nginx Master│ │ Nginx Backup│
│ 192.168.1.10│ │ 192.168.1.11│
│ (MASTER) │ │ (BACKUP) │
└─────┬───────┘ └─────┬───────┘
│ │
└─────────────┬───────────┬─────────┘
│ │
┌───────────▼─┐ ┌─────▼──────────┐ ┌──────────────┐
│ Backend-1 │ │ Backend-2 │ │ Backend-3 │
│ :8080 │ │ :8080 │ │ :8080 │
└─────────────┘ └────────────────┘ └──────────────┘
Step 2: 安装配置 Nginx
目标: 部署标准化 Nginx 服务。
RHEL/CentOS 安装
# 安装 Nginx 官方 Repo
cat <<EOF > /etc/yum.repos.d/nginx.repo
[nginx-stable]
name=nginx stable repo
baseurl=http://nginx.org/packages/rhel/\$releasever/\$basearch/
gpgcheck=1
enabled=1
gpgkey=https://nginx.org/keys/nginx_signing.key
EOF
# 安装 Nginx
yum install -y nginx
# 启动并开机自启
systemctl enable --now nginx
systemctl status nginx
Ubuntu/Debian 安装
# 添加官方 PPA
apt update
apt install -y curl gnupg2 ca-certificates lsb-release
curl -fsSL https://nginx.org/keys/nginx_signing.key | gpg --dearmor > /usr/share/keyrings/nginx-archive-keyring.gpg
echo"deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] \
http://nginx.org/packages/ubuntu $(lsb_release -cs) nginx" > /etc/apt/sources.list.d/nginx.list
# 安装
apt update
apt install -y nginx
# 启动
systemctl enable --now nginx
验证安装:
nginx -v
# 输出: nginx version: nginx/1.24.0
# 测试配置
nginx -t
# 输出:
# nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
# nginx: configuration file /etc/nginx/nginx.conf test is successful
# 访问测试
curl -I http://localhost
# 输出: HTTP/1.1 200 OK
Step 3: 配置 Upstream 与健康检查
目标: 实现后端服务的负载均衡与故障自动摘除。
基础 Upstream 配置
# /etc/nginx/conf.d/upstream.conf
upstream backend_pool {
# 负载算法(默认轮询)
# least_conn; # 最少连接
# ip_hash; # IP 哈希(会话保持)
# hash $request_uri; # URL 哈希
# 后端服务器列表
server192.168.1.21:8080 weight=5 max_fails=3 fail_timeout=10s;
server192.168.1.22:8080 weight=5 max_fails=3 fail_timeout=10s;
server192.168.1.23:8080 weight=3 max_fails=3 fail_timeout=10s backup; # 备用节点
# 连接池配置
keepalive128; # 保持 128 个空闲连接
keepalive_requests1000; # 每连接最多 1000 请求
keepalive_timeout60s; # 空闲连接保持 60s
}
server {
listen80;
server_name api.example.com;
location / {
proxy_pass http://backend_pool;
# 请求头透传
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# 连接复用(HTTP/1.1 必需)
proxy_http_version1.1;
proxy_set_header Connection "";
# 超时配置
proxy_connect_timeout5s;
proxy_send_timeout10s;
proxy_read_timeout10s;
# 缓冲配置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
}
# 健康检查端点(供 Keepalived 使用)
location /health {
access_logoff;
return200"healthy\n";
add_header Content-Type text/plain;
}
}
参数解释:
-
• max_fails=3: 3 次失败后标记为不可用 -
• fail_timeout=10s: 10 秒后重新尝试失败节点 -
• backup: 仅在主节点全部失败时使用 -
• weight: 权重值,分配更多流量给高权重节点
重载配置:
nginx -t && nginx -s reload
# 验证 upstream 状态(需编译 stub_status 模块)
curl http://localhost/nginx_status
主动健康检查(商业版 / 开源替代方案)
方案一:使用 nginx_upstream_check_module(开源模块)
# 下载 Nginx 源码与健康检查模块
cd /tmp
wget http://nginx.org/download/nginx-1.24.0.tar.gz
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git
# 编译安装
tar xf nginx-1.24.0.tar.gz
cd nginx-1.24.0
patch -p1 < /tmp/nginx_upstream_check_module/check_1.20.1+.patch
./configure --prefix=/etc/nginx \
--add-module=/tmp/nginx_upstream_check_module \
--with-http_ssl_module \
--with-http_v2_module \
--with-stream
make && make install
健康检查配置:
upstream backend_pool {
server192.168.1.21:8080;
server192.168.1.22:8080;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send"HEAD /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
server {
location /upstream_status {
check_status;
access_logoff;
}
}
验证健康检查:
curl http://localhost/upstream_status
# 输出:
# Upstream 'backend_pool':
# server 192.168.1.21:8080 status: up total: 1234 success: 1230 failed: 4
# server 192.168.1.22:8080 status: down total: 567 success: 540 failed: 27
Step 4: 部署 Keepalived 实现 VIP
目标: 通过 VRRP 协议实现 Nginx 主备自动切换。
安装 Keepalived
# RHEL/CentOS
yum install -y keepalived
# Ubuntu/Debian
apt install -y keepalived
# 启动服务
systemctl enable --now keepalived
Master 节点配置
# /etc/keepalived/keepalived.conf (Master: 192.168.1.10)
global_defs {
router_id NGINX_MASTER
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_script check_nginx {
script "/etc/keepalived/check_nginx.sh"
interval 2 # 每 2 秒检查一次
weight -20 # 失败时降低优先级 20
fall 2 # 连续 2 次失败触发
rise 1 # 1 次成功恢复
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100 # 优先级(Master 高于 Backup)
advert_int 1
authentication {
auth_type PASS
auth_pass SecurePass2024
}
virtual_ipaddress {
192.168.1.100/24 # VIP 地址
}
track_script {
check_nginx
}
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
notify_fault "/etc/keepalived/notify.sh FAULT"
}
Backup 节点配置
# /etc/keepalived/keepalived.conf (Backup: 192.168.1.11)
# 与 Master 配置相同,仅修改以下参数:
# router_id NGINX_BACKUP
# state BACKUP
# priority 90 (低于 Master)
健康检查脚本
# /etc/keepalived/check_nginx.sh
#!/bin/bash
# 检查 Nginx 进程与端口
pgrep nginx > /dev/null 2>&1 || exit 1
nc -zv localhost 80 > /dev/null 2>&1 || exit 1
# 检查 Nginx 健康检查端点
curl -sf http://localhost/health > /dev/null 2>&1 || exit 1
exit 0
chmod +x /etc/keepalived/check_nginx.sh
# 测试脚本
/etc/keepalived/check_nginx.sh && echo"OK" || echo"FAIL"
状态切换通知脚本
# /etc/keepalived/notify.sh
#!/bin/bash
TYPE=$1
DATE=$(date'+%Y-%m-%d %H:%M:%S')
case$TYPEin
MASTER)
echo"$DATE - Transition to MASTER" >> /var/log/keepalived-state.log
# 可选: 发送告警到 Slack/钉钉
;;
BACKUP)
echo"$DATE - Transition to BACKUP" >> /var/log/keepalived-state.log
;;
FAULT)
echo"$DATE - Fault detected" >> /var/log/keepalived-state.log
;;
esac
chmod +x /etc/keepalived/notify.sh
启动 Keepalived:
systemctl restart keepalived
# 验证 VIP(在 Master 节点)
ip addr show eth0 | grep 192.168.1.100
# 输出:
# inet 192.168.1.100/24 scope global secondary eth0
# 测试 VIP 可达性
curl -I http://192.168.1.100
# 输出: HTTP/1.1 200 OK
故障切换测试:
# 在 Master 节点停止 Nginx
systemctl stop nginx
# 等待 2-3 秒后,在 Backup 节点检查 VIP
ip addr show eth0 | grep 192.168.1.100
# VIP 应已漂移到 Backup 节点
# 恢复 Master
systemctl start nginx
# VIP 将自动切回 Master(priority 更高)
Step 5: 会话保持与负载算法
目标: 根据业务场景选择合适的负载策略。
IP Hash(会话保持)
upstream backend_pool {
ip_hash; # 同一客户端 IP 始终路由到同一后端
server192.168.1.21:8080;
server192.168.1.22:8080;
server192.168.1.23:8080;
}
适用场景: 有状态应用(Session 未共享)
缺点: 单个后端故障时会话丢失,且负载不均(NAT 后单 IP 流量大)
一致性 Hash(URL/Cookie)
upstream backend_pool {
hash$request_uri consistent; # 根据 URL 哈希
server192.168.1.21:8080;
server192.168.1.22:8080;
server192.168.1.23:8080;
}
# 或基于 Cookie
# hash $cookie_jsessionid consistent;
适用场景: 缓存场景(同一资源路由到同一后端,提高缓存命中率)
最少连接(Least Conn)
upstream backend_pool {
least_conn; # 选择当前活跃连接最少的后端
server192.168.1.21:8080;
server192.168.1.22:8080;
server192.168.1.23:8080;
}
适用场景: 后端处理时间差异大的场景(长连接/WebSocket)
加权轮询(默认)
upstream backend_pool {
server192.168.1.21:8080 weight=5; # 50% 流量
server192.168.1.22:8080 weight=3; # 30% 流量
server192.168.1.23:8080 weight=2; # 20% 流量
}
适用场景: 后端硬件配置不同,按能力分配流量
Step 6: SSL/TLS 卸载与证书管理
目标: 在 Nginx 层统一处理 HTTPS,后端使用 HTTP 降低开销。
获取 Let’s Encrypt 证书
# 安装 Certbot
# RHEL/CentOS
yum install -y certbot python3-certbot-nginx
# Ubuntu/Debian
apt install -y certbot python3-certbot-nginx
# 申请证书(自动配置 Nginx)
certbot --nginx -d api.example.com -d www.example.com
# 验证证书
ls -l /etc/letsencrypt/live/api.example.com/
# fullchain.pem privkey.pem chain.pem cert.pem
# 测试自动续期
certbot renew --dry-run
Nginx HTTPS 配置
# /etc/nginx/conf.d/ssl.conf
server {
listen80;
server_name api.example.com;
# 强制跳转 HTTPS
return301 https://$server_name$request_uri;
}
server {
listen443 ssl http2;
server_name api.example.com;
# SSL 证书
ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
# SSL 协议与加密套件(Mozilla Intermediate 配置)
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256';
ssl_prefer_server_ciphersoff;
# SSL 会话缓存
ssl_session_cache shared:SSL:10m;
ssl_session_timeout10m;
ssl_session_ticketsoff;
# OCSP Stapling
ssl_staplingon;
ssl_stapling_verifyon;
resolver8.8.8.88.8.4.4 valid=300s;
# 安全头
add_header Strict-Transport-Security "max-age=63072000" always;
add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
location / {
proxy_pass http://backend_pool;
# ... (其他 proxy 配置同前)
}
}
验证 SSL 配置:
# 测试 SSL 握手
openssl s_client -connect api.example.com:443 -servername api.example.com
# 在线测试(SSL Labs)
# https://www.ssllabs.com/ssltest/analyze.html?d=api.example.com
# 验证证书有效期
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates
# 输出:
# notBefore=Oct 15 00:00:00 2025 GMT
# notAfter=Jan 13 23:59:59 2026 GMT
Step 7: 监控告警与日志收集
Prometheus + Node Exporter
# 安装 nginx-prometheus-exporter
wget https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v0.11.0/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
tar xf nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
sudocp nginx-prometheus-exporter /usr/local/bin/
# 启用 Nginx stub_status
cat <<EOF >> /etc/nginx/conf.d/status.conf
server {
listen 8080;
location /stub_status {
stub_status;
access_log off;
allow 127.0.0.1;
deny all;
}
}
EOF
nginx -s reload
# 启动 Exporter
nohup nginx-prometheus-exporter -nginx.scrape-uri=http://localhost:8080/stub_status &
# 验证指标
curl http://localhost:9113/metrics | grep nginx_
关键 PromQL 查询
# Nginx 请求速率
rate(nginx_http_requests_total[1m])
# 后端响应时间 P99
histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m]))
# 错误率(5xx)
rate(nginx_http_requests_total{status=~"5.."}[1m])
/
rate(nginx_http_requests_total[1m]) * 100
# Upstream 活跃连接数
nginx_upstream_server_connections{state="active"}
日志收集(JSON 格式)
# /etc/nginx/nginx.conf
http {
log_format json_combined escape=json
'{'
'"time_local":"$time_local",'
'"remote_addr":"$remote_addr",'
'"request":"$request",'
'"status":$status,'
'"body_bytes_sent":$body_bytes_sent,'
'"request_time":$request_time,'
'"upstream_response_time":"$upstream_response_time",'
'"upstream_addr":"$upstream_addr",'
'"http_referer":"$http_referer",'
'"http_user_agent":"$http_user_agent"'
'}';
access_log /var/log/nginx/access.log json_combined;
}
日志分析示例:
# 统计 Top 10 请求 URI
cat /var/log/nginx/access.log | jq -r '.request' | awk '{print $2}' | sort | uniq -c | sort -rn | head -10
# 计算平均响应时间
cat /var/log/nginx/access.log | jq -r '.request_time' | awk '{sum+=$1; count++} END {print sum/count}'
# 统计 5xx 错误
cat /var/log/nginx/access.log | jq -r 'select(.status >= 500) | .request'
监控与告警
Grafana 仪表盘
推荐 Dashboard:
-
• Nginx Overview (ID: 12708) -
• Nginx Prometheus Exporter (ID: 12708)
核心面板:
-
• Requests/sec (按 status code 分组) -
• Upstream Response Time (P50/P95/P99) -
• Active Connections / Waiting Connections -
• Upstream Server Health (up/down 状态)
告警规则
# prometheus-alerts.yaml
groups:
-name:nginx_alerts
rules:
-alert:NginxDown
expr:up{job="nginx"}==0
for:1m
labels:
severity:critical
annotations:
summary:"Nginx 实例 {{ $labels.instance }} 不可达"
-alert:Nginx5xxHigh
expr:rate(nginx_http_requests_total{status=~"5.."}[1m])/rate(nginx_http_requests_total[1m])>0.05
for:3m
labels:
severity:warning
annotations:
summary:"Nginx 5xx 错误率 >5%"
-alert:UpstreamDown
expr:nginx_upstream_server_up==0
for:1m
labels:
severity:critical
annotations:
summary:"后端服务器 {{ $labels.server }} 不健康"
-alert:NginxHighLatency
expr:histogram_quantile(0.99,rate(nginx_http_request_duration_seconds_bucket[5m]))>1
for:5m
labels:
severity:warning
annotations:
summary:"Nginx P99 延迟 >1s"
性能与容量
基准测试
# 使用 wrk 测试吞吐量
wrk -t4 -c1000 -d60s --latency http://192.168.1.100/
# 预期输出(2C4G Nginx):
# Requests/sec: 15000+
# Latency (P99): <50ms
# Transfer/sec: 10MB
# 测试 SSL 性能
wrk -t4 -c1000 -d60s --latency https://api.example.com/
# SSL 会降低 20-30% 吞吐
系统调优
# /etc/sysctl.d/99-nginx-tuning.conf
# 网络层优化
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
# 文件描述符
fs.file-max = 2097152
sysctl -p /etc/sysctl.d/99-nginx-tuning.conf
# Nginx worker 进程限制
ulimit -n 100000
# 持久化: 修改 /etc/security/limits.conf
Nginx worker 配置:
# /etc/nginx/nginx.conf
user nginx;
worker_processes auto; # 自动检测 CPU 核数
worker_rlimit_nofile100000;
events {
useepoll;
worker_connections10000; # 每个 worker 最大连接数
multi_accepton;
}
容量规划
单 Nginx 实例理论容量:
- 并发连接数 = worker_processes × worker_connections
- QPS 上限 ≈ worker_connections / 平均响应时间(s)
示例(4C8G,响应时间 100ms):
- 并发连接: 4 × 10000 = 40000
- QPS 上限: 10000 / 0.1 = 100000 QPS(理论值)
- 实际建议: 30000 QPS(预留 70% buffer)
安全与合规
DDoS 防护
# 限流配置
http {
limit_req_zone$binary_remote_addr zone=api_limit:10m rate=100r/s;
limit_conn_zone$binary_remote_addr zone=conn_limit:10m;
server {
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
limit_conn conn_limit 10; # 单 IP 最多 10 个连接
# 其他配置...
}
}
}
访问控制
# IP 白名单
location /admin/ {
allow192.168.1.0/24;
deny all;
}
# HTTP Basic Auth
location /private/ {
auth_basic"Restricted Area";
auth_basic_user_file /etc/nginx/.htpasswd;
}
常见故障与排错
最佳实践(10 条)
-
1. 健康检查多层防护: Keepalived 检查 Nginx 进程 + Nginx 检查后端端点 + 后端自检逻辑 -
2. 连接池必配: upstream keepalive ≥ 后端实例数 × 32 -
3. 超时三段式: connect(5s) + send(10s) + read(10s),防止慢请求阻塞 -
4. 日志 JSON 化: 便于 ELK/Loki 采集与分析,必含 request_time/upstream_response_time -
5. SSL 性能优化: 启用 http2、ssl_session_cache、OCSP stapling -
6. 限流分层: 全局限流 + API 级限流 + 业务逻辑限流 -
7. 灰度发布: 使用 split_clients 或 upstream 权重实现流量分割 -
8. 监控告警三板斧: QPS、5xx率、P99延迟,告警阈值基于历史 P95 值 -
9. 证书自动续期: Certbot 自动 renew + systemd timer,到期前 30 天告警 -
10. 定期演练: 每月执行 Nginx 故障切换、后端服务下线、证书过期场景演练
附录:完整配置样例
生产级 nginx.conf
user nginx;
worker_processes auto;
worker_rlimit_nofile100000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
useepoll;
worker_connections10000;
multi_accepton;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format json_combined escape=json '{'
'"time":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"request":"$request",'
'"status":$status,'
'"body_bytes_sent":$body_bytes_sent,'
'"request_time":$request_time,'
'"upstream_response_time":"$upstream_response_time",'
'"upstream_addr":"$upstream_addr"'
'}';
access_log /var/log/nginx/access.log json_combined;
sendfileon;
tcp_nopushon;
tcp_nodelayon;
keepalive_timeout65;
types_hash_max_size2048;
server_tokensoff;
gzipon;
gzip_varyon;
gzip_min_length1024;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
include /etc/nginx/conf.d/*.conf;
}
测试于: 2025-10, RHEL 8, Nginx 1.24, Keepalived 2.2
文末福利
网络监控是保障网络系统和数据安全的重要手段,能够帮助运维人员及时发现并应对各种问题,及时发现并解决,从而确保网络的顺畅运行。
谢谢一路支持,给大家分享6款开源免费的网络监控工具,并准备了对应的资料文档,建议运维工程师收藏(文末一键领取)。

100%免费领取
一、zabbix
二、Prometheus
内容较多,6款常用网络监控工具(zabbix、Prometheus、Cacti、Grafana、OpenNMS、Nagios)不再一一介绍, 需要的朋友扫码备注【监控合集】,即可100%免费领取。
以上所有资料获取请扫码

100%免费领取
(后台不再回复,扫码一键领取)

