关注【索引目录】服务号,更多精彩内容等你来探索!
Twitter ( ) 上的一个帖子充斥着对 MongoDB 的误解,散播了恐惧、不确定性和怀疑 (FUD)。这导致一位用户质疑 MongoDB 是否会在写入操作实际写入磁盘之前进行确认:
MongoDB 在将写入操作实际刷新到磁盘之前不会确认写入操作吗?
MongoDB 与许多数据库一样,采用日志记录(也称为预写日志 (WAL))来确保高性能和持久性(ACID 中的 D)。这涉及将写入操作安全地记录到日志中,并确保在提交确认之前将其刷新到磁盘。更多详细信息,请参阅“写入关注和日志”下的文档
您可以按照以下步骤在实验室中使用 Linux STRACE 和 GDB 进行测试,以揭穿这些谣言。
启动实验室
我创建了一个本地 MongoDB 服务器。这里我使用了一个单节点的本地 Atlas 集群,但你也可以对副本执行同样的操作:
atlas deployments setup atlas --type local --port 27017 --force
如果它已停止,请启动它,并使用 MongoDB Shell 连接:
atlas deployment start atlas
mongosh
使用以下方式跟踪系统调用strace
在另一个终端中,我曾经strace显示系统调用 ( -e trace) 来写入 ( pwrite64) 和同步 ( fdatasync) 文件,以及文件名 ( -yy),由 MongoDB 服务器进程 ( -p $(pgrep -d, mongod)) 及其线程 ( -f) 执行,并显示执行时间和时间戳 ( -tT):
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
一些写入和同步在后台进行
[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>
写入收藏夹
在 MongoDB shell 中,我创建了一个集合并运行了十次更新:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
print(` ${i} ${new Date()}`)
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
strace运行十次更新循环时输出以下内容:
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>
每次写入pwrite64日志文件( )之后都会同步到磁盘(fdatasync)。此系统调用有详细记录:
FSYNC(2) Linux Programmer's Manual FSYNC(2)
NAME
fsync, fdatasync - synchronize a file's in-core state with storage device
DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
...
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification
...
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
由于我同时显示了提交时间和系统调用跟踪时间,因此您可以看到它们是匹配的。与上述跟踪相关的输出演示了这种对齐:
1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
多文档交易
上例运行了十次自动提交更新,每次都调用一次磁盘同步。
通常,良好的文档数据建模应该使文档与业务事务匹配。然而,可以使用多文档事务,它们符合 ACID(原子性、一致性、隔离性和持久性)原则。使用多文档事务还可以减少同步延迟,因为每个事务只需在提交时执行一次同步。
我运行了以下五个事务,每个事务运行一次更新和一次插入:
const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
session.startTransaction();
const sessionDb = session.getDatabase(db.getName());
sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} updated ${new Date()}`)
sessionDb.mycollection.insertOne( { answer:42 });
print(` ${i} inserted ${new Date()}`)
session.commitTransaction();
print(` ${i} committed ${new Date()}`)
}
Strace 仍然显示有 10 次调用pwrite64和fdatasync。我使用这个多文档事务进一步证明,提交不仅会触发同步到磁盘,而且还会等待其确认后才向应用程序返回成功的反馈。
注入一些延迟gdb
为了表明提交等待确认,fdatasync我使用了 GDB 断点进行调用fdatasyc。
我停止了 strace,并使用一个脚本启动了 GDB,该脚本增加了五秒的延迟fdatasync:
cat > gdb_slow_fdatasync.gdb <<GDB
break fdatasync
commands
shell sleep 5
continue
end
continue
GDB
gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)
我运行了五个事务和两次写入操作。GDB 显示何时到达断点:
Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6
我的 GDB 脚本自动等待五秒钟并继续执行程序,直到下一次调用fdatasync。
以下是我的包含五笔交易的循环的输出:
1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
插入和更新操作会立即发生,但由于我使用 GDB 注入的延迟,提交操作本身会等待五秒钟。这表明提交操作会等待fdatasync,从而确保数据能够刷新到持久存储。在本演示中,我使用了 MongoDB 8.0 中的所有默认设置,但此行为仍然可以通过写入关注点和日志配置进行调整。
我使用 GDB 检查了调用堆栈。或者,您也可以通过添加以下选项来使用 strace 注入延迟-e inject=fdatasync:delay_enter=5000000:
查看开源代码
调用 时fdatasync可能会发生错误,如果继续对文件描述符进行操作(请记住 PostgreSQL 的fsyncgate 函数),这可能会损害持久性。MongoDB 使用开源的 WiredTiger 存储引擎,该引擎实现了与 PostgreSQL 相同的解决方案来避免这种情况:使用恐慌而不是重试。您可以查看os_fs.c代码来验证这一点。
调用fdatasync在JournalFlusher线程中,这里是回溯:
#0 0x0000ffffa0b5ceec in fdatasync () from target:/lib64/libc.so.6
#1 0x0000aaaadf5312c0 in __posix_file_sync ()
#2 0x0000aaaadf4f53c8 in __log_fsync_file ()
#3 0x0000aaaadf4f58d4 in __wt_log_force_sync ()
#4 0x0000aaaadf4fb8b8 in __wt_log_flush ()
#5 0x0000aaaadf588348 in __session_log_flush ()
#6 0x0000aaaadf41b878 in mongo::WiredTigerSessionCache::waitUntilDurable(mongo::OperationContext*, mongo::WiredTigerSessionCache::Fsync, mongo::WiredTigerSessionCache::UseJournalListener) ()
#7 0x0000aaaadf412358 in mongo::WiredTigerRecoveryUnit::waitUntilDurable(mongo::OperationContext*) ()
#8 0x0000aaaadfbe855c in mongo::JournalFlusher::run() ()
如果您想查看背后的代码,这里有一些入口点:
-
JournalFlusher,调用WiredTiger的os_fs.c中的posix系统调用:journal_flusher.cpp -
waitForWriteConcern,从连接线程调用 JournalFlusher:write_concern.cpp
你的观点应基于事实,而不是神话。
MongoDB 最初是一个 NoSQL 数据库,优先考虑可用性和低延迟,而非强一致性。然而,这已是十多年前的事了。随着技术的发展,拒绝持续学习的专家面临着知识过时、技能下降和信誉受损的风险。
如今,MongoDB 是一个通用数据库,支持事务原子性、一致性、隔离性和持久性——无论事务涉及单个文档还是多个文档。
下次你再遇到无知者或批评者声称 MongoDB 不一致或无法将已提交的更改刷新到磁盘时,你可以通过参考官方文档、开源代码以及进行自己的实验来自信地揭穿这些谣言。MongoDB 与 PostgreSQL 类似:缓冲写入,并在提交时将 WAL 同步到磁盘。
关注【索引目录】服务号,更多精彩内容等你来探索!

