关注【索引目录】服务号,更多精彩内容等你来探索!
曾经strace显示从任何 MongoDB 服务器线程写入和同步到磁盘的所有调用:
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
添加副本以实现高可用性
我在单台服务器上完成了上述操作,并使用 Atlas CLI 启动。接下来,我们将在包含三台服务器的副本集上执行相同操作。我使用以下 Docker Compose 启动它:
services:
mongo-1:
image: mongo:8.0.10
ports:
- "27017:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-1:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
mongo-2:
image: mongo:8.0.10
ports:
- "27018:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-2:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
mongo-3:
image: mongodb/mongodb-community-server:latest
ports:
- "27019:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-3:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
init-replica-set:
image: mongodb/mongodb-community-server:latest
depends_on:
- mongo-1
- mongo-2
- mongo-3
entrypoint: |
bash -xc '
sleep 10
mongosh --host mongo-1 --eval "
rs.initiate( {_id: \"rs0\", members: [
{_id: 0, priority: 3, host: \"mongo-1:27017\"},
{_id: 1, priority: 2, host: \"mongo-2:27017\"},
{_id: 2, priority: 1, host: \"mongo-3:27017\"}]
});
"
'
networks:
- mongoha
volumes:
mongo-data-1:
mongo-data-2:
mongo-data-3:
networks:
mongoha:
driver: bridge
我启动了它docker compose up -d ; sleep 10,然后运行了strace命令。我连接到主节点docker compose exec -it mongo-1 mongosh
运行一些交易
我执行了与上一篇文章相同的操作,对集合进行了十次写入:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
print(` ${i} ${new Date()}`)
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
以下是此过程中的 strace 输出:
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61184) = 512 <0.000086>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55808) = 384 <0.000097>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000656>
[pid 8786] 10:05:38 <... fdatasync resumed>) = 0 <0.002739>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54528) = 384 <0.000129>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61696) = 512 <0.000094>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001070>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56192) = 384 <0.000118>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000927>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54912) = 384 <0.000112>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000687>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62208) = 512 <0.000066>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000717>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56576) = 384 <0.000095>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000745>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55296) = 384 <0.000063>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000782>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62720) = 512 <0.000084>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56960) = 384 <0.000080>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000814>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55680) = 384 <0.000365>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000747>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63232) = 512 <0.000096>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000724>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57344) = 384 <0.000108>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001432>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56064) = 384 <0.000118>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000737>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63744) = 512 <0.000061>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000636>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57728) = 384 <0.000070>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000944>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56448) = 384 <0.000105>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64256) = 512 <0.000092>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000742>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58112) = 384 <0.000067>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000704>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56832) = 384 <0.000152>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000732>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64768) = 512 <0.000061>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58496) = 384 <0.000062>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000653>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57216) = 384 <0.000102>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001502>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65280) = 512 <0.000072>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58880) = 384 <0.000123>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8786] 10:05:38 <... fdatasync resumed>) = 0 <0.001538>
[pid 8736] 10:05:38 <... fdatasync resumed>) = 0 <0.000625>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57600) = 384 <0.000084>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000847>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65792) = 512 <0.000060>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000661>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 59264) = 384 <0.000074>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000779>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57984) = 384 <0.000077>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000816>
我可以看到三个进程的写入和同步操作。让我们检查一下哪个进程属于哪个容器:
for pid in 8736 8786 8889; do
cid=$(grep -ao 'docker[-/][0-9a-f]\{64\}' /proc/$pid/cgroup | head -1 | grep -o '[0-9a-f]\{64\}')
svc=$(docker inspect --format '{{ index .Config.Labels "com.docker.compose.service"}}' "$cid" 2>/dev/null)
echo "PID: $pid -> Container ID: $cid -> Compose Service: ${svc:-<not-found>}"
done
PID: 8736 -> Container ID: 93e3ebd715867f1cd885d4c6191064ba0eb93b02c0884a549eec66026c459ac2 -> Compose Service: mongo-3
PID: 8786 -> Container ID: cf52ad45d25801ef1f66a7905fa0fb4e83f23376e4478b99dbdad03456cead9e -> Compose Service: mongo-1
PID: 8889 -> Container ID: c28f835a1e7dc121f9a91c25af1adfb1d823b667c8cca237a33697b4683ca883 -> Compose Service: mongo-2
这证实了默认情况下,WAL 在每个副本上提交时同步到磁盘,而不仅仅是在主副本上。
模拟一个节点故障
[pid 8786]是mongo-1,这是我的主要:
rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
...
mongo-1:27017
我停止一个副本:
docker compose pause mongo-3
[+] Pausing 1/0
✔ Container pgbench-mongo-mongo-3-1 Paused
我再次运行更新,它们不受一个副本的影响:
rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
mongo-1:27017
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
...
1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
模拟两个节点故障
我停止了另一个副本:
docker compose pause mongo-2
[+] Pausing 1/0
✔ Container demo-mongo-2-1 Paused
由于不再有法定人数,在三个成员的副本集中只有一个副本,主副本被降级并且无法提供读取或更新:
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
1 Mon Jun 30 2025 09:28:36 GMT+0000 (Coordinated Universal Time)
MongoServerError[NotWritablePrimary]: not primary
从二级读取
剩下的节点现在是辅助节点,并公开大多数节点确认的最后写入:
rs0 [direct: secondary] test> db.mycollection.find()
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 20 } ]
如果其他节点重新启动但与该辅助节点隔离,则辅助节点仍显示相同的时间线一致但陈旧的读数。
我通过断开该节点并重新启动其他节点来模拟这种情况:
docker network disconnect demo_mongoha demo-mongo-1-1
docker unpause demo-mongo-2-1
docker unpause demo-mongo-3-1
由于另外两个节点组成了法定人数,因此有一个主节点可以接受写入:
-bash-4.2# docker compose exec -it mongo-2 mongosh
Current Mongosh Log ID: 686264bd3e0326801369e327
Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB: 8.0.10
Using Mongosh: 2.5.2
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
rs0 [direct: primary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 30 } ]
rs0 [direct: primary] test>
尽管如此,断开连接的次级设备仍处于其所知的最后状态:
rs0 [direct: primary] test> exit
-bash-4.2# docker compose exec -it mongo-1 mongosh
Current Mongosh Log ID: 6862654959d1c9bbde69e327
Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB: 8.0.10
Using Mongosh: 2.5.2
rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test>
为避免在网络故障期间读取过时状态的数据,请将“读取首选项”设置为primary。此方法可确保您从主数据库读取最新的数据。要在所有辅助节点上实现最后的一致状态,则需要在网络故障期间停止对主数据库的写入,从而损害可用性。
大多数分布式数据库都使用类似的复制策略,即将写入操作同步到仲裁节点,以确保持久性和故障恢复能力。等待仲裁节点的确认可以使系统保持高可用性,同时确保数据安全地存储在多个节点上,并将已提交的更改同步到持久存储。
关注【索引目录】服务号,更多精彩内容等你来探索!

