强一致性MongoDB 高可用持久写入- 大数跨境

首页

强一致性MongoDB 高可用持久写入

索引目录

2025-07-01

导读：关注【索引目录】服务号，更多精彩内容等你来探索！

关注【索引目录】服务号，更多精彩内容等你来探索！

曾经strace显示从任何 MongoDB 服务器线程写入和同步到磁盘的所有调用：

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

添加副本以实现高可用性

我在单台服务器上完成了上述操作，并使用 Atlas CLI 启动。接下来，我们将在包含三台服务器的副本集上执行相同操作。我使用以下 Docker Compose 启动它：

services:  

  mongo-1:  
    image: mongo:8.0.10  
    ports:  
      - "27017:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-1:/data/db  
    command: mongod --bind_ip_all --replSet rs0  
    networks:  
      - mongoha

  mongo-2:  
    image: mongo:8.0.10  
    ports:  
      - "27018:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-2:/data/db  
    command: mongod --bind_ip_all --replSet rs0 
    networks:  
      - mongoha

  mongo-3:  
    image: mongodb/mongodb-community-server:latest  
    ports:  
      - "27019:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-3:/data/db  
    command: mongod --bind_ip_all --replSet rs0   
    networks:  
      - mongoha

  init-replica-set:  
    image: mongodb/mongodb-community-server:latest  
    depends_on:  
      - mongo-1  
      - mongo-2  
      - mongo-3  
    entrypoint: |  
      bash -xc '  
        sleep 10  
        mongosh --host mongo-1 --eval "  
         rs.initiate( {_id: \"rs0\", members: [  
          {_id: 0, priority: 3, host: \"mongo-1:27017\"},  
          {_id: 1, priority: 2, host: \"mongo-2:27017\"},  
          {_id: 2, priority: 1, host: \"mongo-3:27017\"}]  
         });  
        "  
      '     
    networks:  
      - mongoha

volumes:  
  mongo-data-1:  
  mongo-data-2:  
  mongo-data-3:  

networks:  
  mongoha:  
    driver: bridge

我启动了它docker compose up -d ; sleep 10，然后运行了strace命令。我连接到主节点docker compose exec -it mongo-1 mongosh

运行一些交易

我执行了与上一篇文章相同的操作，对集合进行了十次写入：

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 print(` ${i} ${new Date()}`)
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

 1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)

以下是此过程中的 strace 输出：

[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61184) = 512 <0.000086>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55808) = 384 <0.000097>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000656>
[pid  8786] 10:05:38 <... fdatasync resumed>) = 0 <0.002739>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54528) = 384 <0.000129>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61696) = 512 <0.000094>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001070>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56192) = 384 <0.000118>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000927>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54912) = 384 <0.000112>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000687>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62208) = 512 <0.000066>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000717>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56576) = 384 <0.000095>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000745>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55296) = 384 <0.000063>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000782>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62720) = 512 <0.000084>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56960) = 384 <0.000080>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000814>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55680) = 384 <0.000365>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000747>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63232) = 512 <0.000096>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000724>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57344) = 384 <0.000108>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001432>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56064) = 384 <0.000118>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000737>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63744) = 512 <0.000061>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000636>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57728) = 384 <0.000070>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000944>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56448) = 384 <0.000105>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64256) = 512 <0.000092>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000742>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58112) = 384 <0.000067>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000704>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56832) = 384 <0.000152>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000732>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64768) = 512 <0.000061>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58496) = 384 <0.000062>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000653>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57216) = 384 <0.000102>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001502>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65280) = 512 <0.000072>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58880) = 384 <0.000123>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8786] 10:05:38 <... fdatasync resumed>) = 0 <0.001538>
[pid  8736] 10:05:38 <... fdatasync resumed>) = 0 <0.000625>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57600) = 384 <0.000084>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000847>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65792) = 512 <0.000060>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000661>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 59264) = 384 <0.000074>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000779>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57984) = 384 <0.000077>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000816>

我可以看到三个进程的写入和同步操作。让我们检查一下哪个进程属于哪个容器：

for pid in 8736 8786 8889; do  
  cid=$(grep -ao 'docker[-/][0-9a-f]\{64\}' /proc/$pid/cgroup | head -1 | grep -o '[0-9a-f]\{64\}')  
    svc=$(docker inspect --format '{{ index .Config.Labels "com.docker.compose.service"}}' "$cid" 2>/dev/null)  
    echo "PID: $pid -> Container ID: $cid -> Compose Service: ${svc:-<not-found>}"  
done  

PID: 8736 -> Container ID: 93e3ebd715867f1cd885d4c6191064ba0eb93b02c0884a549eec66026c459ac2 -> Compose Service: mongo-3
PID: 8786 -> Container ID: cf52ad45d25801ef1f66a7905fa0fb4e83f23376e4478b99dbdad03456cead9e -> Compose Service: mongo-1
PID: 8889 -> Container ID: c28f835a1e7dc121f9a91c25af1adfb1d823b667c8cca237a33697b4683ca883 -> Compose Service: mongo-2

这证实了默认情况下，WAL 在每个副本上提交时同步到磁盘，而不仅仅是在主副本上。

模拟一个节点故障

[pid 8786]是mongo-1，这是我的主要：

rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
... 
mongo-1:27017

我停止一个副本：

docker compose pause mongo-3

[+] Pausing 1/0
 ✔ Container pgbench-mongo-mongo-3-1  Paused

我再次运行更新，它们不受一个副本的影响：

rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
mongo-1:27017

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
...
 1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)

模拟两个节点故障

我停止了另一个副本：

docker compose pause mongo-2

[+] Pausing 1/0
 ✔ Container demo-mongo-2-1  Paused

由于不再有法定人数，在三个成员的副本集中只有一个副本，主副本被降级并且无法提供读取或更新：

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
 1 Mon Jun 30 2025 09:28:36 GMT+0000 (Coordinated Universal Time)
MongoServerError[NotWritablePrimary]: not primary

从二级读取

剩下的节点现在是辅助节点，并公开大多数节点确认的最后写入：

rs0 [direct: secondary] test> db.mycollection.find()

[ { _id: 1, num: 20 } ]

rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")  

[ { _id: 1, num: 20 } ]

如果其他节点重新启动但与该辅助节点隔离，则辅助节点仍显示相同的时间线一致但陈旧的读数。

我通过断开该节点并重新启动其他节点来模拟这种情况：

docker network disconnect demo_mongoha demo-mongo-1-1
docker unpause demo-mongo-2-1
docker unpause demo-mongo-3-1

由于另外两个节点组成了法定人数，因此有一个主节点可以接受写入：

-bash-4.2# docker compose exec -it mongo-2 mongosh                                                                                                                                                                                                                                                         
Current Mongosh Log ID: 686264bd3e0326801369e327
Connecting to:          mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB:          8.0.10
Using Mongosh:          2.5.2

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
 1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)

rs0 [direct: primary] test> db.mycollection.find().readConcern("majority")

[ { _id: 1, num: 30 } ]

rs0 [direct: primary] test>

尽管如此，断开连接的次级设备仍处于其所知的最后状态：

rs0 [direct: primary] test> exit
-bash-4.2# docker compose exec -it mongo-1 mongosh
Current Mongosh Log ID: 6862654959d1c9bbde69e327
Connecting to:          mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB:          8.0.10
Using Mongosh:          2.5.2

rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")  
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test>

为避免在网络故障期间读取过时状态的数据，请将“读取首选项”设置为primary。此方法可确保您从主数据库读取最新的数据。要在所有辅助节点上实现最后的一致状态，则需要在网络故障期间停止对主数据库的写入，从而损害可用性。

大多数分布式数据库都使用类似的复制策略，即将写入操作同步到仲裁节点，以确保持久性和故障恢复能力。等待仲裁节点的确认可以使系统保持高可用性，同时确保数据安全地存储在多个节点上，并将已提交的更改同步到持久存储。

关注【索引目录】服务号，更多精彩内容等你来探索！

【声明】内容源于网络

索引目录

索引目录是一家专注于医疗、技术开发、物联网应用等领域的创新型公司。我们致力于为客户提供高质量的服务和解决方案，推动技术与行业发展。

内容 444

粉丝 0

索引目录索引目录是一家专注于医疗、技术开发、物联网应用等领域的创新型公司。我们致力于为客户提供高质量的服务和解决方案，推动技术与行业发展。

总阅读12

粉丝0

内容444