集群机器 offline

问题描述

  • 我的一台 tdengine(server101:6030) 机器有问题了。 其他机器都offline,重启也没有办法恢复
    在 server04 上:
taos> show mnodes;
     id      |            endpoint            |      role      |   status    |       create_time       |        role_time        |
==================================================================================================================================
           1 | server101:6030                 | offline        | offline     | 2025-05-14 11:52:56.064 | 1970-01-01 08:00:00.000 |
           4 | server103:6030                 | offline        | offline     | 2025-05-14 11:55:20.981 | 1970-01-01 08:00:00.000 |
           5 | server104:6030                 | leader         | ready       | 2025-05-14 11:55:22.533 | 2025-07-18 15:36:12.886 |
Query OK, 3 row(s) in set (0.003116s)

taos> show dnodes;
     id      |            endpoint            | vnodes | support_vnodes |    status    |       create_time       |       reboot_time       |              note              |
=============================================================================================================================================================================
           1 | server101:6030                 |      4 |              0 | offline      | 2025-05-14 11:52:56.061 | 1970-01-01 08:00:00.000 | status not received            |
           2 | server100:6030                 |      5 |              0 | offline      | 2025-05-14 11:53:58.688 | 1970-01-01 08:00:00.000 | status not received            |
           3 | server102:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.696 | 1970-01-01 08:00:00.000 | status not received            |
           4 | server103:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.703 | 1970-01-01 08:00:00.000 | status not received            |
           5 | server104:6030                 |      4 |            197 | ready        | 2025-05-14 11:54:00.429 | 2025-06-06 22:18:33.106 |                                |
Query OK, 5 row(s) in set (0.003546s)

机器 101 崩了后(应该是硬件问题),我们在抢救 101,之后就发现机器是这个状态了

103,我把 fistEP 改成 104,也不能将机器启动起来。 103 上的日志是:
aos> show mnodes;
     id      |            endpoint            |      role      |   status    |       create_time       |        role_time        |
==================================================================================================================================
           1 | server101:6030                 | offline        | offline     | 2025-05-14 11:52:56.064 | 1970-01-01 08:00:00.000 |
           4 | server103:6030                 | offline        | offline     | 2025-05-14 11:55:20.981 | 1970-01-01 08:00:00.000 |
           5 | server104:6030                 | leader         | ready       | 2025-05-14 11:55:22.533 | 2025-07-18 15:36:12.886 |
Query OK, 3 row(s) in set (0.003116s)

taos> show dnodes;
     id      |            endpoint            | vnodes | support_vnodes |    status    |       create_time       |       reboot_time       |              note              |
=============================================================================================================================================================================
           1 | server101:6030                 |      4 |              0 | offline      | 2025-05-14 11:52:56.061 | 1970-01-01 08:00:00.000 | status not received            |
           2 | server100:6030                 |      5 |              0 | offline      | 2025-05-14 11:53:58.688 | 1970-01-01 08:00:00.000 | status not received            |
           3 | server102:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.696 | 1970-01-01 08:00:00.000 | status not received            |
           4 | server103:6030                 |      4 |              0 | offline      | 2025-05-14 11:53:58.703 | 1970-01-01 08:00:00.000 | status not received            |
           5 | server104:6030                 |      4 |            197 | ready        | 2025-05-14 11:54:00.429 | 2025-06-06 22:18:33.106 |                                |
Query OK, 5 row(s) in set (0.003546s)```


就应该机器 101 崩了后(应该是硬件问题),我们在抢救 101,之后就发现机器是这个状态了
  • 103 上的操作
    • 修改 firstEP, secondEP,都变成 104
    • 还是有错误,不能拿连上
[root@server103 taos]# tail -f /var/log/taos/taoslog0.0 
07/22 11:17:54.210488 00006128 C TSC ep:0, server104:6030
07/22 11:17:54.210494 00006128 C TSC ep:1, server103:6030
07/22 11:17:59.234488 00006602 C RPC WARN TSC msg need retry,QID:0x3467e52343b50001:0x6645030234300002
07/22 11:18:00.212448 00006602 C RPC ERROR TSC conn 0x7f25100095a0 failed to connect to server101:6030 since host is unreachable
07/22 11:18:00.212843 00006602 C RPC WARN TSC msg need retry,QID:0x3467e52343b50001:0x6645030234300002
07/22 11:18:10.241562 00006128 C TSC ERROR failed to connect to server, reason: Sync leader is unreachable
07/22 11:18:10.241601 00006128 C TSC WARN failed to remove request from hash, code:Invalid parameters
07/22 11:18:12.415889 00006128 C TSC connecting to server, numOfEps:2 inUse:0 user:root db:(null) key:root:dcc5bed04851fec854c035b2e40263b6:(null):0
07/22 11:18:12.415902 00006128 C TSC ep:0, server104:6030
07/22 11:18:12.415907 00006128 C TSC ep:1, server103:6030
07/22 11:18:17.422189 00006603 C RPC WARN TSC msg need retry,QID:0x3467e5238abf0002:0x6645030238a00003
07/22 11:18:18.250608 00006603 C RPC ERROR TSC conn 0x7f25080095a0 failed to connect to server101:6030 since host is unreachable
07/22 11:18:18.250928 00006603 C RPC WARN TSC msg need retry,QID:0x3467e5238abf0002:0x6645030238a00003
07/22 11:18:28.281729 00006128 C TSC ERROR failed to connect to server, reason: Sync leader is unreachable
07/22 11:18:28.281766 00006128 C TSC WARN failed to remove request from hash, code:Invalid parameters
07/22 11:18:30.419961 00006122 C TSC connecting to server, numOfEps:2 inUse:0 user:root db:(null) key:root:dcc5bed04851fec854c035b2e40263b6:(null):0
07/22 11:18:30.419978 00006122 C TSC ep:0, server104:6030
07/22 11:18:30.419983 00006122 C TSC ep:1, server103:6030
07/22 11:18:35.426332 00006604 C RPC WARN TSC msg need retry,QID:0x3467e523d1140003:0x664503023d100004
07/22 11:18:36.286870 00006604 C RPC ERROR TSC conn 0x7f25000095a0 failed to connect to server101:6030 since host is unreachable
07/22 11:18:36.287294 00006604 C RPC WARN TSC msg need retry,QID:0x3467e523d1140003:0x664503023d100004
07/22 11:18:46.319480 00006122 C TSC ERROR failed to connect to server, reason: Sync leader is unreachable
07/22 11:18:46.319510 00006122 C TSC WARN failed to remove request from hash, code:Invalid parameters

使用的 tdengine 的版本: TDengine-server-3.3.5.8

如遇相同问题可另起新帖。