Taos库崩溃

【TDengine 使用环境】
测试

【TDengine 版本】

3.3.7.0

【操作系统以及版本】

centos7

【部署方式】容器/非容器部署

容器

【集群节点数】

1

【集群副本数】

1

【描述业务影响】

【问题复现路径/shan】做过哪些操作出现的问题

【遇到的问题:问题现象及影响】

数据库突然崩溃,无法访问。

【资源配置】

【报错完整截图】(不要大段的粘贴报错代码,论坛直接看报错代码不直观)

错误日志如下

double free or corruption (!prev)
11/14 16:44:33.239643 00000338 DB  ERROR QID:0x4000000000000066 latency:16.251324664s, sql:insert into km_c497725d28c8_6043 using keeper_monitor tags (‘c497725d28c8:6043’) values ( now,  0.016667, 0.103304, 0), err:Post “http://127.0.0.1:6041/rest/sql/log?req_id=8210669156717035604”: read tcp 127.0.0.1:60986->127.0.0.1:6041: read: connection reset by peer11/14 16:44:33.239679 00000338 MON ERROR execute sql:insert into km_c497725d28c8_6043 using keeper_monitor tags (‘c497725d28c8:6043’) values ( now,  0.016667, 0.103304, 0), error:Post “http://127.0.0.1:6041/rest/sql/log?req_id=8210669156717035604”: read tcp 127.0.0.1:60986->127.0.0.1:6041: read: connection reset by peer/usr/bin/entrypoint.sh: line 116:   295 Aborted                 (core dumped) taosadapter11/14 16:44:43.002434 00000338 DB  ERROR QID:0x100000000014b00 latency:408.673?s, sql:insert into log.taosd_cluster_basic_6133841669712502475 using taosd_cluster_basic tags (‘6133841669712502475’) values (1763109883001, ‘localhost:6030’, 1, ‘3.3.7.0’) , err:Post “http://127.0.0.1:6041/rest/sql/log?req_id=8210669156823990357”: dial tcp 127.0.0.1:6041: connect: connection refused11/14 16:44:43.002487 00000338 GEN ERROR QID:0x100000000014b00 insert taosd_cluster_basic error, msg:Post “http://127.0.0.1:6041/rest/sql/log?req_id=8210669156823990357”: dial tcp 127.0.0.1:6041: connect: connection refused11/14 16:44:43.004218 00000338 WEB ERROR
2025/11/14 16:44:43 [Recovery] 2025/11/14 - 16:44:43 panic recovered:runtime error: invalid memory address or nil pointer dereference/root/.gvm/gos/go1.24.2/src/runtime/panic.go:262 (0x47daf8)/root/.gvm/gos/go1.24.2/src/runtime/signal_unix.go:925 (0x47dac8)/data/release/main/TDinternal/community/tools/keeper/api/gen_metric.go:303 (0xb8976e)/data/release/main/TDinternal/community/tools/keeper/api/gen_metric.go:262 (0xb88c1c)/data/release/main/TDinternal/community/tools/keeper/api/gen_metric.go:217 (0xb88244)/root/.gvm/pkgsets/go1.24.2/global/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0x95566e)/root/.gvm/pkgsets/go1.24.2/global/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 (0x95565b)/data/release/main/TDinternal/community/tools/keeper/infrastructure/log/web.go:53 (0xa9817b)/root/.gvm/pkgsets/go1.24.2/global/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xa97b86)/data/release/main/TDinternal/community/tools/keeper/infrastructure/log/web.go:24 (0xa97b6d)/root/.gvm/pkgsets/go1.24.2/global/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0x9544ad)/root/.gvm/pkgsets/go1.24.2/global/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 (0x954144)/root/.gvm/pkgsets/go1.24.2/global/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 (0x953c89)/root/.gvm/gos/go1.24.2/src/net/http/server.go:3301 (0x71c44d)/root/.gvm/gos/go1.24.2/src/net/http/server.go:2102 (0x6f9e84)/root/.gvm/gos/go1.24.2/src/runtime/asm_amd64.s:1700 (0x483880)

出现内存错误了,请帮忙生成 corefile 然后使用 gdb 打印一下backtrace

在宿主机上运行以下命令设置 core 生成

 sudo tee -a /etc/sysctl.conf <<< 'kernel.core_pattern = /var/crash/core.%e.%p.%h.%t'  \
     && sudo mkdir -p /var/crash \
     && sudo chmod 777 /var/crash \
     && sudo sysctl -p

容器启动时添加额外参数 --privileged --ulimit core=-1 同时挂载 /var/crash 目录到容器的 /var/crash 目录

例如

docker run -itd --privileged --ulimit core=-1 -v /var/crash:/var/crash

当内存错误时 /var/crash 文件夹会产生 core 文件,使用 gdb 挂载该文件

gdb taosadapter core.xxxxx

之后执行 bt
将 bt 的截图发上来

按照上述步骤操作后并没有产生core文件,但是我可以再现这个bug:

一.前置条件:

1.使用taos-jdbcdriver:3.7.1的websocket驱动,restful驱动也有这个问题

2.使用Hikari创建连接池,配置如下:

Properties properties = new Properties();
properties.setProperty(TSDBDriver.PROPERTY_KEY_ASYNC_WRITE, "stmt");
properties.setProperty(TSDBDriver.PROPERTY_KEY_BATCH_SIZE_BY_ROW, String.valueOf(batchSize));
properties.setProperty(TSDBDriver.PROPERTY_KEY_CACHE_SIZE_BY_ROW, String.valueOf(cacheSize));
properties.setProperty(TSDBDriver.PROPERTY_KEY_BACKEND_WRITE_THREAD_NUM, String.valueOf(writeThreadNum));
properties.setProperty(TSDBDriver.PROPERTY_KEY_ENABLE_AUTO_RECONNECT, "true");
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:TAOS-WS://10.216.36.90:6041/mdc_db?characterEncoding=UTF-8&useUnicode=true&useSSL=false&zeroDateTimeBehavior=convertToNull&serverTimezone=Asia/Shanghai&useInformationSchema=true");
config.setUsername("root");
config.setPassword("taosdata");
config.setMinimumIdle(10);
config.setMaximumPoolSize(20);
config.setConnectionTimeout(30000);
config.setMaxLifetime(0);
config.setIdleTimeout(0);
config.setConnectionTestQuery("SELECT 1");
config.setDataSourceProperties(properties);
HikariDataSource dataSource = new HikariDataSource(config);

二.再现步骤
1.创建一个超级表

CREATE STABLE IF NOT EXISTS mdc_db.test (ts TIMESTAMP, p1 binary (64),p2 binary (64),p3 binary (64)) TAGS (group_id binary(64));

2.给这个表添加一个字段

ALTER STABLE mdc_db.test ADD COLUMN p4 binary (64);

3.从连接池获取数据库连接就会报上面的错误,继而数据库崩溃

4.即使数据库重启后,只要从连接池获取连接就会报上面的错误,继而数据库崩溃

上面的创建超级表和给超级表添加字段的操作都是通过从连接池中获取连接后使用PreparedStatement执行的

按照我上面提供的重现步骤,每次都可以重现数据库崩溃

感觉这个问题比较严重,希望能够帮忙解决(按照上面生成core文件的步骤配置后,显示配置成功,容器内显示unlimited,但是崩溃的时候并没有生成core文件)

能提供一个脱敏的完整复现代码吗

隔了几天后再次运行,又不能再现了,我再观察一下

需具体排查,如遇相同问题,请重新发帖。