大数跨境
0
0

从宕机到恢复:分布式数据库故障排查记录

从宕机到恢复:分布式数据库故障排查记录 老苏畅谈运维
2025-03-23
2
导读:1. 故障现象之前在虚拟机环境搭了一套GBASE 8C分布式数据库(见GBase 8C 集群安装部署全攻略,轻

1. 故障现象

之前在虚拟机环境搭了一套GBASE 8C分布式数据库(见GBase 8C 集群安装部署全攻略,轻松上手!),由于没有正常关机(主机突然断电),导致集群异常,异常如下:

--检查集群状态
[gbase@gbase8c1 ~]$ gha_ctl monitor all -H -l http://10.10.10.34:2379
{
    "ret":80000301,
    "msg":"Transport endpoint unreach"
}
[gbase@gbase8c1 ~]$ gha_ctl monitor all -H -l http://10.10.10.36:2379
+----+-------------+-------------+-------+---------+--------+
| No |     name    |     host    |  port |  state  | leader |
+----+-------------+-------------+-------+---------+--------+
0  | gha_server1 | 10.10.10.34 | 20001 | running |  True  |
+----+-------------+-------------+-------+---------+--------+
+----+------+-------------+------+---------------------------+---------+---------+
| No | name |     host    | port |          work_dir         |  state  |   role  |
+----+------+-------------+------+---------------------------+---------+---------+
0  | gtm1 | 10.10.10.34 | 6666 | /home/gbase/data/gtm/gtm1 | running | primary |
+----+------+-------------+------+---------------------------+---------+---------+
+----+------+-------------+------+----------------------------+---------+---------+
| No | name |     host    | port |          work_dir          |  state  |   role  |
+----+------+-------------+------+----------------------------+---------+---------+
0  | cn1  | 10.10.10.34 | 5432 | /home/gbase/data/coord/cn1 | running | primary |
+----+------+-------------+------+----------------------------+---------+---------+
+----+-------+-------+-------------+-------+----------------------------+---------+---------+
| No | group |  name |     host    |  port |          work_dir          |  state  |   role  |
+----+-------+-------+-------------+-------+----------------------------+---------+---------+
0  |  dn1  | dn1_1 | 10.10.10.35 | 15432 | /home/gbase/data/dn1/dn1_1 | running | primary |
1  |  dn2  | dn2_1 | 10.10.10.36 | 20010 | /home/gbase/data/dn2/dn2_1 | running | primary |
+----+-------+-------+-------------+-------+----------------------------+---------+---------+
+----+-------------------------+--------+-----------+----------+
| No |           url           |  name  |   state   | isLeader |
+----+-------------------------+--------+-----------+----------+
0  | http://10.10.10.36:2379 | node_2 |  healthy  |  False   |
1  | http://10.10.10.34:2379 | node_0 | unhealthy |  False   |
2  | http://10.10.10.35:2379 | node_1 |  healthy  |   True   |
+----+-------------------------+--------+-----------+----------+

发现节点 10.10.10.34 的状态为unhealthy,问题在10.10.10.34节点。

2. 故障排查

根据上述的报错 {    "ret":80000301,    "msg":"Transport endpoint unreach" }信息,主要对以下几个方面进行排查:

(1)确认时间是否同步,三台机器是否有时间差。确认ntpd服务运行情况。

(2)机器IP是否有变化,网络通讯是否正常。

(3)确认etcd服务运行情况。

对前两项进行排查,无异常,检查etcd服务时,发现了异常。

2.1 检查 etcd 服务

--对所有节点进行检查etcd服务状态
# systemctl status etcd
● etcd.service - Etcd Server
   Active: activating (start) since Thu 2025-03-22 23:03:26 CST; 48s ago
   [...]
Mar 22 23:03:34 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out

检查所有节点,服务都正常,但在10.10.10.34查看服务状态时,有发现以下报错:

[root@gbase8c1 member]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Thu 2025-03-20 23:03:26 CST; 48s ago
Main PID: 10363 (etcd)
   CGroup: /docker/e7ff60899b159f0e16156801bae5649ccb06983ff489abe0cdd252941cb2fcfa/system.slice/etcd.service
           └─10363 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379
           ‣ 10363 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379

Mar 22 23:03:27 gbase8c1 etcd[10363]: established a TCP streaming connection with peer 1ee8d2017f324082 (stream Message writer)
Mar 22 23:03:27 gbase8c1 etcd[10363]: established a TCP streaming connection with peer 1ee8d2017f324082 (stream MsgApp v2 writer)
Mar 22 23:03:27 gbase8c1 etcd[10363]: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream MsgApp v2 writer)
Mar 22 23:03:27 gbase8c1 etcd[10363]: 9c5365ebdda29888 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
Mar 22 23:03:34 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out, possibly due to connection lost
Mar 22 23:03:41 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:03:48 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:03:55 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:04:02 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:04:09 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out

从报错看,etcd节点在尝试发布数据时遇到超时问题,可能由网络问题、节点配置错误或资源不足引起。

2.2 查看系统日志

查看10.10.10.34的系统日志:

[root@gbase8c1 ~]# tail -f /var/log/messages
Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer 1ee8d2017f324082 (stream MsgApp v2 writer)
Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer 1ee8d2017f324082 (stream MsgApp v2 reader)
Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer 1ee8d2017f324082 (stream Message writer)
Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream MsgApp v2 writer)
Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream Message writer)
Mar 22 23:12:24 gbase8c1 etcd: started streaming with peer e0dea71e4a2e0936 (stream Message reader)
Mar 22 23:12:24 gbase8c1 etcd: raft.node: 9c5365ebdda29888 elected leader e0dea71e4a2e0936 at term 173
Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream Message reader)
Mar 22 23:12:24 gbase8c1 etcd: 9c5365ebdda29888 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
Mar 22 23:12:31 gbase8c1 etcd: publish error: etcdserver: request timed out, possibly due to connection lost
Mar 22 23:12:38 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:12:45 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:12:52 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:12:59 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:06 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:13 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:20 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:27 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:34 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:41 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:48 gbase8c1 etcd: publish error: etcdserver: request timed out
Mar 22 23:13:54 gbase8c1 systemd: etcd.service start operation timed out. Terminating.


2.3 检查 etcd 配置

--检查所有节点的etcd配置

[root@gbase8c1 ~]# cat /etc/etcd/etcd.conf
ETCD_DATA_DIR = "/var/lib/etcd/default.etcd"
ETCD_ENABLE_V2 = "true"
ETCD_INITIAL_CLUSTER_TOKEN = "etcd-cluster"
ETCD_NAME="node_0"
ETCD_LISTEN_PEER_URLS="http://10.10.10.34:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.10.10.34:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.10.10.34:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://10.10.10.34:2379"
ETCD_INITIAL_CLUSTER="node_0=http://10.10.10.34:2380,node_1=http://10.10.10.35:2380,node_2=http://10.10.10.36:2380"
[root@gbase8c2 ~]# cat /etc/etcd/etcd.conf
ETCD_DATA_DIR = "/var/lib/etcd/default.etcd"
ETCD_ENABLE_V2 = "true"
ETCD_INITIAL_CLUSTER_TOKEN = "etcd-cluster"
ETCD_NAME="node_1"
ETCD_LISTEN_PEER_URLS="http://10.10.10.35:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.10.10.35:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.10.10.35:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://10.10.10.35:2379"
ETCD_INITIAL_CLUSTER="node_0=http://10.10.10.34:2380,node_1=http://10.10.10.35:2380,node_2=http://10.10.10.36:2380"
[root@gbase8c3 ~]# cat /etc/etcd/etcd.conf
ETCD_DATA_DIR = "/var/lib/etcd/default.etcd"
ETCD_ENABLE_V2 = "true"
ETCD_INITIAL_CLUSTER_TOKEN = "etcd-cluster"
ETCD_NAME="node_2"
ETCD_LISTEN_PEER_URLS="http://10.10.10.36:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.10.10.36:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.10.10.36:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://10.10.10.36:2379"
ETCD_INITIAL_CLUSTER="node_0=http://10.10.10.34:2380,node_1=http://10.10.10.35:2380,node_2=http://10.10.10.36:2380"

相关配置项说明:

ETCD_NAME
#ETCD集群中的节点名,这里可以随意,可区分且不重复就行  
ETCD_LISTEN_PEER_URLS
#监听的用于节点之间通信的URL,可监听多个,集群内部将通过这些URL进行数据交互(如选举,数据同步等)
ETCD_INITIAL_ADVERTISE_PEER_URLS 
#建议用于节点之间通信的URL,节点间将以该值进行通信。
ETCD_LISTEN_CLIENT_URLS
#监听的用于客户端通信的URL,同样可以监听多个。
ETCD_ADVERTISE_CLIENT_URLS
#建议使用的客户端通信URL,该值用于ETCD代理或ETCD成员与ETCD节点通信。
ETCD_INITIAL_CLUSTER_TOKEN 
#节点的TOKEN值,设置该值后集群将生成唯一ID,并为每个节点也生成唯一ID,当使用相同配置文件再启动一个集群时,只要该TOKEN值不一样,ETCD集群就不会相互影响。
ETCD_INITIAL_CLUSTER
#集群中所有的INITIAL_ADVERTISE_PEER_URLS 的合集

所有节点etcd配置正常。

3. 故障处理

3.1 重启etcd服务

--重启etcd服务

# systemctl restart etcd

先尝试重启了10.10.10.34的etcd服务,发现没有效果,报错依然。后面把其他节点的etcd服务也重启了,依然没有效果。

3.2 清理异常节点数据

由于是突然宕机,怀疑集群节点间etcd的数据不一致导致的报错,尝试删除集群下报错节点的数据,使它重新同步:

# 停止 etcd 服务
[root@gbase8c1 ~]# systemctl stop etcd

# 保险起见,使用mv方式移除数据目录到其他地方,有问题再恢复
[root@gbase8c1 ~]# mv /var/lib/etcd/default.etcd /tmp

# 重启 etcd 服务
[root@gbase8c1 ~]# systemctl start etcd

# 查看 etcd状态
[root@gbase8c1 ~]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2025-03-22 23:40:55 CST; 3min 57s ago
 Main PID: 277 (etcd)
   CGroup: /docker/e7ff60899b159f0e16156801bae5649ccb06983ff489abe0cdd252941cb2fcfa/system.slice/etcd.service
           └─277 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379
           ‣ 277 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379

Mar 22 23:40:55 gbase8c1 etcd[277]: established a TCP streaming connection with peer 1ee8d2017f324082 (stream Message reader)
Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 [term: 186] received a MsgVote message with higher term from e0dea71e4a2e0936 [term: 187]
Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 became follower at term 187
Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 [logterm: 186, index: 724331, vote: 0] cast MsgVote for e0dea71e4a2e0936 [logterm: 186, index: 724331] at term 187
Mar 22 23:40:55 gbase8c1 etcd[277]: raft.node: 9c5365ebdda29888 elected leader e0dea71e4a2e0936 at term 187
Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
Mar 22 23:40:55 gbase8c1 etcd[277]: published {Name:node_0 ClientURLs:[http://10.10.10.34:2379]} to cluster 3503f38b8057518f
Mar 22 23:40:55 gbase8c1 etcd[277]: ready to serve client requests
Mar 22 23:40:55 gbase8c1 etcd[277]: serving insecure client requests on 10.10.10.34:2379, this is strongly discouraged!
Mar 22 23:40:55 gbase8c1 systemd[1]: Started Etcd Server.

etcd服务已正常,不再报错。

3.3. 验证

检查集群状态:

[gbase@gbase8c1 ~]$ gha_ctl monitor all -H -l http://10.10.10.35:2379
+----+-------------+-------------+-------+---------+--------+
| No |     name    |     host    |  port |  state  | leader |
+----+-------------+-------------+-------+---------+--------+
0  | gha_server1 | 10.10.10.34 | 20001 | running |  True  |
+----+-------------+-------------+-------+---------+--------+
+----+------+-------------+------+---------------------------+---------+---------+
| No | name |     host    | port |          work_dir         |  state  |   role  |
+----+------+-------------+------+---------------------------+---------+---------+
0  | gtm1 | 10.10.10.34 | 6666 | /home/gbase/data/gtm/gtm1 | running | primary |
+----+------+-------------+------+---------------------------+---------+---------+
+----+------+-------------+------+----------------------------+---------+---------+
| No | name |     host    | port |          work_dir          |  state  |   role  |
+----+------+-------------+------+----------------------------+---------+---------+
0  | cn1  | 10.10.10.34 | 5432 | /home/gbase/data/coord/cn1 | running | primary |
+----+------+-------------+------+----------------------------+---------+---------+
+----+-------+-------+-------------+-------+----------------------------+---------+---------+
| No | group |  name |     host    |  port |          work_dir          |  state  |   role  |
+----+-------+-------+-------------+-------+----------------------------+---------+---------+
0  |  dn1  | dn1_1 | 10.10.10.35 | 15432 | /home/gbase/data/dn1/dn1_1 | running | primary |
1  |  dn2  | dn2_1 | 10.10.10.36 | 20010 | /home/gbase/data/dn2/dn2_1 | running | primary |
+----+-------+-------+-------------+-------+----------------------------+---------+---------+
+----+-------------------------+--------+---------+----------+
| No |           url           |  name  |  state  | isLeader |
+----+-------------------------+--------+---------+----------+
0  | http://10.10.10.36:2379 | node_2 | healthy |  False   |
1  | http://10.10.10.34:2379 | node_0 | healthy |  False   |
2  | http://10.10.10.35:2379 | node_1 | healthy |   True   |
+----+-------------------------+--------+---------+----------+

集群所有节点,均已正常。

喜欢这篇文章,欢迎动动你发财的小手点个赞👍!关注我,学习更多的数据库知识!


另本公众号当前有限时福利,免费送书,明天开奖,详情见 别被AI淘汰,这本书会让你比80%的人更懂DeepSeek(限时领)!没参与的小伙伴,快来参与吧。

【声明】内容源于网络
0
0
老苏畅谈运维
1234
内容 122
粉丝 0
老苏畅谈运维 1234
总阅读751
粉丝0
内容122