Tdengine 整个集群 down了

【TDengine 使用环境】
生产环境

【TDengine 版本】
https://ask.taosdata.com/
【操作系统以及版本】

【部署方式】容器/非容器部署
非容器部署
【集群节点数】
5
【集群副本数】
3
【描述业务影响】
有一个机器 down 了,刚好是 firstEP,然后整个集群都down 了。
【问题复现路径/shan】做过哪些操作出现的问题
没有做什么特殊操作
【遇到的问题:问题现象及影响】
问题描述

  • 我的一台 tdengine(server101:6030) 机器有问题了。 其他机器都offline,重启也没有办法恢复
    在 server04 上:
taos> show mnodes;
     id      |            endpoint            |      role      |   status    |       create_time       |        role_time        |
==================================================================================================================================
           1 | server101:6030                 | offline        | offline     | 2025-05-14 11:52:56.064 | 1970-01-01 08:00:00.000 |
           4 | server103:6030                 | offline        | offline     | 2025-05-14 11:55:20.981 | 1970-01-01 08:00:00.000 |
           5 | server104:6030                 | leader         | ready       | 2025-05-14 11:55:22.533 | 2025-07-18 15:36:12.886 |
Query OK, 3 row(s) in set (0.003116s)

taos> show dnodes;
     id      |            endpoint            | vnodes | support_vnodes |    status    |       create_time       |       reboot_time       |              note              |
=============================================================================================================================================================================
           1 | server101:6030                 |      4 |              0 | offline      | 2025-05-14 11:52:56.061 | 1970-01-01 08:00:00.000 | status not received            |
           2 | server100:6030                 |      5 |              0 | offline      | 2025-05-14 11:53:58.688 | 1970-01-01 08:00:00.000 | status not received            |
           3 | server102:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.696 | 1970-01-01 08:00:00.000 | status not received            |
           4 | server103:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.703 | 1970-01-01 08:00:00.000 | status not received            |
           5 | server104:6030                 |      4 |            197 | ready        | 2025-05-14 11:54:00.429 | 2025-06-06 22:18:33.106 |                                |
Query OK, 5 row(s) in set (0.003546s)

机器 101 崩了后(应该是硬件问题),我们在抢救 101,之后就发现机器是这个状态了

103,我把 fistEP 改成 104,也不能将机器启动起来。 103 上的日志是:
aos> show mnodes;
     id      |            endpoint            |      role      |   status    |       create_time       |        role_time        |
==================================================================================================================================
           1 | server101:6030                 | offline        | offline     | 2025-05-14 11:52:56.064 | 1970-01-01 08:00:00.000 |
           4 | server103:6030                 | offline        | offline     | 2025-05-14 11:55:20.981 | 1970-01-01 08:00:00.000 |
           5 | server104:6030                 | leader         | ready       | 2025-05-14 11:55:22.533 | 2025-07-18 15:36:12.886 |
Query OK, 3 row(s) in set (0.003116s)

taos> show dnodes;
     id      |            endpoint            | vnodes | support_vnodes |    status    |       create_time       |       reboot_time       |              note              |
=============================================================================================================================================================================
           1 | server101:6030                 |      4 |              0 | offline      | 2025-05-14 11:52:56.061 | 1970-01-01 08:00:00.000 | status not received            |
           2 | server100:6030                 |      5 |              0 | offline      | 2025-05-14 11:53:58.688 | 1970-01-01 08:00:00.000 | status not received            |
           3 | server102:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.696 | 1970-01-01 08:00:00.000 | status not received            |
           4 | server103:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.703 | 1970-01-01 08:00:00.000 | status not received            |
           5 | server104:6030                 |      4 |            197 | ready        | 2025-05-14 11:54:00.429 | 2025-06-06 22:18:33.106 |                                |
Query OK, 5 row(s) in set (0.003546s)```


就应该机器 101 崩了后(应该是硬件问题),我们在抢救 101,之后就发现机器是这个状态了
  • 103 上的操作
    • 修改 firstEP, secondEP,都变成 104
    • 还是有错误,不能拿连上
[root@server103 taos]# tail -f /var/log/taos/taoslog0.0 
07/22 11:17:54.210488 00006128 C TSC ep:0, server104:6030
07/22 11:17:54.210494 00006128 C TSC ep:1, server103:6030
07/22 11:17:59.234488 00006602 C RPC WARN TSC msg need retry,QID:0x3467e52343b50001:0x6645030234300002
07/22 11:18:00.212448 00006602 C RPC ERROR TSC conn 0x7f25100095a0 failed to connect to server101:6030 since host is unreachable
07/22 11:18:00.212843 00006602 C RPC WARN TSC msg need retry,QID:0x3467e52343b50001:0x6645030234300002
07/22 11:18:10.241562 00006128 C TSC ERROR failed to connect to server, reason: Sync leader is unreachable
07/22 11:18:10.241601 00006128 C TSC WARN failed to remove request from hash, code:Invalid parameters
07/22 11:18:12.415889 00006128 C TSC connecting to server, numOfEps:2 inUse:0 user:root db:(null) key:root:dcc5bed04851fec854c035b2e40263b6:(null):0
07/22 11:18:12.415902 00006128 C TSC ep:0, server104:6030
07/22 11:18:12.415907 00006128 C TSC ep:1, server103:6030
07/22 11:18:17.422189 00006603 C RPC WARN TSC msg need retry,QID:0x3467e5238abf0002:0x6645030238a00003
07/22 11:18:18.250608 00006603 C RPC ERROR TSC conn 0x7f25080095a0 failed to connect to server101:6030 since host is unreachable
07/22 11:18:18.250928 00006603 C RPC WARN TSC msg need retry,QID:0x3467e5238abf0002:0x6645030238a00003
07/22 11:18:28.281729 00006128 C TSC ERROR failed to connect to server, reason: Sync leader is unreachable
07/22 11:18:28.281766 00006128 C TSC WARN failed to remove request from hash, code:Invalid parameters
07/22 11:18:30.419961 00006122 C TSC connecting to server, numOfEps:2 inUse:0 user:root db:(null) key:root:dcc5bed04851fec854c035b2e40263b6:(null):0
07/22 11:18:30.419978 00006122 C TSC ep:0, server104:6030
07/22 11:18:30.419983 00006122 C TSC ep:1, server103:6030
07/22 11:18:35.426332 00006604 C RPC WARN TSC msg need retry,QID:0x3467e523d1140003:0x664503023d100004
07/22 11:18:36.286870 00006604 C RPC ERROR TSC conn 0x7f25000095a0 failed to connect to server101:6030 since host is unreachable
07/22 11:18:36.287294 00006604 C RPC WARN TSC msg need retry,QID:0x3467e523d1140003:0x664503023d100004
07/22 11:18:46.319480 00006122 C TSC ERROR failed to connect to server, reason: Sync leader is unreachable
07/22 11:18:46.319510 00006122 C TSC WARN failed to remove request from hash, code:Invalid parameters

【资源配置】

【报错完整截图】

现在是每个节点的taosd进程都起不来?

还是起来了连不上

101 机器有问题。 但是其他机器是 ok 的。 但是整个机器都挂了。

这个问题还是很严重的。 down 了 firstEP,整个集群就没有了。 这不合理呀。

清理一下其他节点的日志,把debugFlag设置为 135 ,重新启动,5分钟后, 把日志上传,或者微信发给我 a15652223354

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。