

记录一次使用Telegrafa处理Influxdb v2磁盘高IO问题

David跨境日记

2025-10-21

导读：当Influxdb2面临bucket高基数高IO问题的时候，就需要优化程序代码和增添telegra中间件进行解决优化，此文章记录本人单位出现的问题与解决方案。

一、问题

开发同事反映, 线上的InfluxDB服务器, 使用web图形化界面查询非常缓慢，

从下图的阿里云监控磁盘IO读写中就能看到,磁盘IO的读一直处于超负载状态，

最高时候平均读IO高达150MB/秒，因为读的高IO，影响InfluxDB线上数据桶名叫jt，

导致写入延迟并且丢失部分数据，这极大影响了业务数据完整性和可用性

Telegraf

二、分析

1.程序日志

首先写入数据延迟丢失桶名叫jt，写入客户端是使用go编写的程序,查看下面的go程序日志，

发现大量的写入异常报错，将此go程序终止，观察磁盘IO瞬间负载就恢复正常，

这就证明问题出现在程序或者此InfluxDB名为jt的桶上。

{"level":"error","timestamp":"2025-10-08T00:12:55.659+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:13:10.292+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:13:25.992+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:13:42.502+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:14:00.116+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:14:15.766+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:14:27.656+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:14:39.615+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T00:14:50.310+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:25:26.512+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:31:58.866+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:32:09.139+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:35:59.374+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:36:09.923+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:36:20.206+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-08T16:36:30.885+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:10:16.687+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:10:28.356+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:10:44.493+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:10:58.603+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:11:12.604+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:11:29.668+0800","caller":"jtdata/jtdata.go:79","msg":"internal error: unexpected error writing points to database: timeout"}{"level":"error","timestamp":"2025-10-09T00:11:50.172+0800","caller":"jtdata/jtdata.go:79","msg":"Post \"http://localhost:8086/api/v2/write?bucket=jt&org=GTEX&precision=ns\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

2.基数的概念和高基数的危害

2.1什么是 Series 基数？

首先，要彻底理解什么是 Series。

一个 Series 在 InfluxDB 中是由 Measurement（测量名称）、Tag Set（所有 Tag 键值对的组合）和 Field（字段名称）共同唯一确定的。

Series 基数就是数据库中所有唯一 Series 的数量。

举个例子：假设你有一个 Measurement 叫 weather_data，用来存储天气信息。

Tags: city（城市）, sensor_id（传感器ID）

Fields: temperature（温度）, humidity（湿度）

如果你有：

3 个城市：beijing, shanghai, guangzhou

每个城市有 2 个传感器：sensor_1, sensor_2

2 个字段：temperature, humidity

那么你的 Series 总数计算如下：

Series 由 measurement + tag set + field 确定。

对于 temperature 字段，可能的 Tag 组合有 3 cities * 2 sensors = 6 种。

对于 humidity 字段，同样有 3 cities * 2 sensors = 6 种。

总 Series 基数 = 6 (temperature) + 6 (humidity) = 12。这是一个非常健康的基数。

2.2高基数如何产生及它的危害

如何产生高基数？高基数通常是由于将一个包含大量唯一值的维度设置为了 Tag（标签）。

继续上面的例子，如果你错误地将 request_id（每次请求都不同）或 user_id（用户数量巨大）这样的字段设为了 Tag，那么基数会爆炸式增长。

假设你有 100,000 个独立用户，每个用户上报 temperature 数据：

Series 数 = 100,000 users * 1 field = 100,000

如果你的数据结构更复杂，基数会轻松达到百万甚至千万级别。

高基数的危害：高基数对 InfluxDB 是致命的，因为它直接影响最核心的组件—— TSM（Time-Structured Merge Tree）引擎。

内存消耗巨大：每个唯一的 Series 在内存中都有一个索引条目。Series 越多，内存占用越高，最终导致 OOM（Out-Of-Memory）错误，进程崩溃。

查询性能急剧下降：查询需要扫描大量的 Series 索引，速度会变得非常慢，尤其是涉及 GROUP BY 或 WHERE 条件在 Tag 上时。

磁盘 I/O 增加：数据分布在更多的小文件中，压缩和 compaction 过程效率降低，耗时更长，导致持续的高 I/O 压力。

写入速度下降：每次写入都需要更新巨大的内存索引，会使写入延迟变高

经查询jt桶高基数问题最重要的原因是

业务中定义了DeviceID 、SensorID、Manufactor 3个Tag，

acc_x、acc_z、pressure、rssi、source、status、temperature、voltage 8个Field

DeviceID估计拥有5000个左右的不同的ID

SensorID估计拥有600个左右的不同的ID

Manufactor估计拥有3个左右的不同ID

计算：50006003*8 = 7200W

相近于桶基数报告的6300W的超高基数

3.桶基数报告

从下面的报告分析

jt桶中的基数已经达到了6300多万，

远远超出了官方推荐的10万以内的基数推荐。

influxd inspect report-db --db-path /var/lib/influxdb/.influxdbv2/engine/data/d8390b10d2ccb963bucket             retention policy measurement series------             ---------------- ----------- ------"d8390b10d2ccb963""autogen"        "Data"      63488581"d8390b10d2ccb963""autogen"        "GNSS"      32893"d8390b10d2ccb963""autogen"        "STATE"     43830"d8390b10d2ccb963""autogen"                    63519728"d8390b10d2ccb963"                              63519728Total (est.)                                    63519728

4.结论

经过查阅大量资料和influxdb v2官方文档,还有以上的分析，

得出结论是此jt桶的基数过大,造成过多的tsm文件生成,当tsm生成速度过快的时候，

会触发influxdb后台系统进行tsm数据合并,在合并期间会读取大量tsm文件, 然后写入临时tsm文件中,

导致IO读写都会升高，这时候因为数据过多,缓存会占用大量内存,当服务器内存过小或有限,

会导致频繁的内存和磁盘唤入唤出数据，因此读取的压力会更大,就造成了磁盘读压力将IO占满,

数据源源不断的写入,桶的基数也越来越大,就会造成恶性循环。

三、解决方案

1.程序层面

go程序源代码优化去掉以下Tag与Field

Manufactor 1个Tag，

acc_x、acc_z、source、status 4个Field

减少基数乘积

2.架构层面

2.1引入telegraf中间件，进行削峰填谷，数据缓存后进行批量提交，

从而减小tsm文件，tsm文件减少，就会减少合并，从而减少IO。

2.2引入telegraf双写机制，进行主备库同时写入，

主库jt桶保留最新7天的数据，备库保留永久数据，

当业务人员需要查询influxdb数据的时候，访问备库web页面查询数据，

从而不影响主库的IO和业务稳定性。

2.3telegraf搭建

ubuntu安装curl --silent --location -O https://repos.influxdata.com/influxdata-archive.keygpg --show-keys --with-fingerprint --with-colons ./influxdata-archive.key 2>&1 \| grep -q '^fpr:\+24C975CBA61A024EE1B631787C3D57159FC2F927:$' \&& cat influxdata-archive.key \| gpg --dearmor \| sudo tee /etc/apt/keyrings/influxdata-archive.gpg > /dev/null \&& echo 'deb [signed-by=/etc/apt/keyrings/influxdata-archive.gpg] https://repos.influxdata.com/debian stable main' \| sudo tee /etc/apt/sources.list.d/influxdata.listsudo apt-get update && sudo apt-get install telegrafredhat&centos&rocky安装cat <<EOF | sudo tee /etc/yum.repos.d/influxdata.repo[influxdata]name = InfluxData Repository - Stablebaseurl = https://repos.influxdata.com/stable/$basearch/mainenabled = 1gpgcheck = 1gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-influxdataEOFsudo yum install telegraf编辑配置文件，设置缓冲批量提交双写cd /etc/telegrafvim double_wirte_test.conf# Configuration for telegraf agent[agent]  ## Default data collection interval for all inputs  interval = "10s"  ## Rounds collection interval to 'interval'  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.  round_interval = true  ## Telegraf will send metrics to outputs in batches of at most  ## metric_batch_size metrics.  ## This controls the size of writes that Telegraf sends to output plugins.  metric_batch_size = 1000  ## Maximum number of unwritten metrics per output.  Increasing this value  ## allows for longer periods of output downtime without dropping metrics at the  ## cost of higher maximum memory usage.  metric_buffer_limit = 10000  ## Collection jitter is used to jitter the collection by a random amount.  ## Each plugin will sleep for a random time within jitter before collecting.  ## This can be used to avoid many plugins querying things like sysfs at the  ## same time, which can have a measurable effect on the system.  collection_jitter = "0s"  ## Default flushing interval for all outputs. Maximum flush_interval will be  ## flush_interval + flush_jitter  flush_interval = "10s"  ## Jitter the flush interval by a random amount. This is primarily to avoid  ## large write spikes for users running a large number of telegraf instances.  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s  flush_jitter = "0s"  ## By defaultor when set to "0s", precision will be set to the same  ## timestamp order as the collection interval, with the maximum being 1s.  ##   ie, when interval = "10s", precision will be "1s"  ##       when interval = "250ms", precision will be "1ms"  ## Precision will NOT be used for service inputs. It is up to each individual  ## service input to set the timestamp at the appropriate precision.  ## Valid time units are "ns", "us" (or"µs"), "ms", "s".  precision = ""  ## Log at debug level.  #debug = true  ## Log only error level messages.  # quiet = false  ## Log target controls the destination for logs and can be one of "file",  ## "stderr"or, on Windows, "eventlog".  When set to "file", the output file  ## is determined by the "logfile" setting.  # logtarget = "file"  ## Name of the file to be logged to when using the "file" logtarget.  If set to  ## the empty string then logs are written to stderr.  # logfile = ""  ## The logfile will be rotated after the time interval specified.  When set  ## to 0 no time based rotation is performed.  Logs are rotated only when  ## written to, if there is no log activity rotation may be delayed.  # logfile_rotation_interval = "0d"  ## The logfile will be rotated when it becomes larger than the specified  ## size.  When set to 0 no size based rotation is performed.  # logfile_rotation_max_size = "0MB"  ## Maximum number of rotated archives to keep, any older logs are deleted.  ## If set to -1, no archives are removed.  # logfile_rotation_max_archives = 5  ## Pick a timezone to use when logging or type 'local'for local time.  ## Example: America/Chicago  # log_with_timezone = ""  ## Override default hostname, if empty use os.Hostname()  hostname = ""  ## If set to true, do no set the "host" tag in the telegraf agent.  omit_hostname = true[[inputs.influxdb_v2_listener]]  ## Address and port to host InfluxDB listener on  ## (Double check the port. Could be 9999ifusing OSS Beta)  service_address = ":8087"  ## Maximum allowed HTTP request body size in bytes.  ## 0 means to use the default of 32MiB.  # max_body_size = "32MiB"  ## Optional tag to determine the bucket.  ## If the write has a bucket in the query string then it will be kept in this tag name.  ## This tag can be used in downstream outputs.  ## The default value of nothing means it will be off and the database will not be recorded.  # bucket_tag = ""  ## Set one or more allowed client CA certificate file names to  ## enable mutually authenticated TLS connections  # tls_allowed_cacerts = ["/etc/telegraf/clientca.pem"]  ## Add service certificate and key  # tls_cert = "/etc/telegraf/cert.pem"  # tls_key = "/etc/telegraf/key.pem"  ## Optional token to accept for HTTP authentication.  ## You probably want to make sure you have TLS configured above forthis.  #token = "bwRR7x-Xw2gZEKuU7U2Xs8siwEr6AQYOZXyrmgPOeAQZFnH2kGZ53hCrROpZq_xUIv8j_nomYgnDn4R3egkhMQ=="  token = "To23iYgAsdALSlYRseYZijj3LR7IQatDQRbL-VoQXVGiNwfMcWqc2GEcmbGFhQQ9Cl8n8nM-LKjpbr3bN0jWxw=="  ## Influx line protocol parser  ## 'internal' is the default. 'upstream' is a newer parser that is faster  ## and more memory efficient.  parser_type = "upstream"[[outputs.influxdb_v2]]  # 第二个 InfluxDB v2 实例  urls = ["http://主库ip地址:8086"] # 请替换为你的第三个实例地址  token = "pDbRtCHdCuleCSzgyQ_kxJc2KxGcOcnbkemaq1Cq3AHwueOhQ6QxbMbeJNH11vbg5DSY5CBjRgb54ab9Mn2vFg=="  organization = "GTEX"  bucket = "jt"  metric_batch_size = 10000  # 增大批次大小  metric_buffer_limit = 1000000 # 增大缓冲区  flush_interval = "15s" # 缩短刷新间隔  content_encoding = "gzip" #启用gzip压缩  concurrent_writes = 4 #启用多线程  influx_uint_support = true #启用支持unsigned类型字段  [outputs.influxdb_v2.tags]    influxdb_instance = "primary"[[outputs.influxdb_v2]]  # 第一个 InfluxDB v2 实例  urls = ["http://备库ip地址:28086"] # 请替换为你的第二个实例地址  token = "OlXD8YHa4ZJgk5KgGoqkBPaMKXIMB_SPppgSirgfNaB9ciU6kfot-NTk2NtDjkxxqBf1YNyKtNenv9HD12kCbQ=="  organization = "GTEX"  bucket = "jt"  metric_batch_size = 10000  # 增大批次大小  metric_buffer_limit = 1000000 # 增大缓冲区  flush_interval = "15s" # 缩短刷新间隔  content_encoding = "gzip" #启用gzip压缩  concurrent_writes = 4 #启用多线程  influx_uint_support = true #启用支持unsigned类型字段  [outputs.influxdb_v2.tags]    influxdb_instance = "secondary"启动telegrafcd /etc/telegrafnohup telegraf --config /etc/telegraf/double_wirte_test.conf &启动后telegraf会监听8087端口，程序客户端服务就可以直接连接http://influxdb服务器ip:8087，进行数据写入，telegraf收到数据后缓冲到缓冲区，将客户端写入的line格式数据按批提交到声明outputs.influxdb_v2的influxdb数据库中。

3.时序数据库选择

当以上方案还是无法解决超大级别的桶大基数问题的时候，

需要考虑更换时序数据库，以下是时序数据库的推荐。

InfluxDB 3.0（推荐：极致优化大基数场景）

InfluxDB 3.0 基于开源项目 IOx（InfluxDB IOx）重构，专为高基数、高写入场景设计，是目前处理大基数最成熟的方案之一。

核心解决手段：

列存 + 分区存储：

采用列式存储引擎（基于 Apache Arrow），将不同标签和指标的数据按列拆分存储。对于高基数标签，仅存储唯一值并通过字典编码（Dictionary Encoding）映射，大幅减少重复存储（例如 100 万个设备 ID，仅存储一次唯一值，用整数 ID 映射）。

同时按时间范围（如每小时）分区，查询时仅扫描目标时间区间的数据，避免全表扫描。

无索引设计 + 谓词下推：

摒弃传统时序数据库的倒排索引（避免索引膨胀），通过 “谓词下推” 技术，在扫描数据时直接过滤不符合条件的标签，结合列存的高效批量处理能力，即使高基数场景下查询仍能保持高效。

自动降基数优化：

对高频重复的标签组合自动聚合，减少存储和计算压力；支持动态调整编码方式（如对低基数标签用位图索引，高基数标签用哈希映射）。

TDengine（推荐：物联网场景专用，标签分层优化）

TDengine 是专为物联网（IoT）设计的时序数据库，天然应对设备类高基数场景（如百万级传感器、设备 ID）。

核心解决手段：

超级表 + 子表架构：

引入 “超级表（STable）” 概念，将同一类型的设备（如温度传感器）抽象为超级表，每个设备作为 “子表（Table）”。标签分为 “静态标签”（如设备型号、厂商，存储在超级表元数据中）和 “动态标签”（如实时状态），静态标签仅存储一次，避免重复，大幅降低基数压力。

标签值预编码：

对高频出现的标签值（如设备 ID）进行预编码（整数映射），写入和查询时通过编码值操作，减少字符串比较的开销，提升处理速度。

时序聚合引擎：

内置时序聚合功能，支持按标签维度（如区域、型号）自动聚合数据，查询高基数标签时可直接返回聚合结果，无需扫描所有子表。

特别提醒：

如果说之前使用过influxdb v2的用户，可以优先使用influxdb v3去替换解决，

因为influxdb v3完全兼容v2 客户端接口，说明客户端程序代码无需更改，

这样会大大减少迁移数据库的工作量。

Telegraf

VSP

微信号｜shao5621404

博客｜http://vincent.shaopengtrusit.top

【声明】内容源于网络

David跨境日记

跨境分享说 | 每日分享跨境见解

内容 42855

粉丝 1

David跨境日记跨境分享说 | 每日分享跨境见解

总阅读243.1k

粉丝1

内容42.9k