【TDengine 使用环境】
生产环境
【TDengine 版本】
3.3.6.0
【操作系统以及版本】
k8s集群:
Client Version: v1.23.17
Server Version: v1.23.17
PRETTY_NAME=“Ubuntu 22.04.5 LTS”
NAME=“Ubuntu”
VERSION_ID=“22.04”
VERSION=“22.04.5 LTS (Jammy Jellyfish)”
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
【部署方式】容器/非容器部署
容器
【集群节点数】
3
【集群副本数】
3
【描述业务影响】
重启从节点 有至少20秒以上不可用,不符合集群高可用
【问题复现路径/shan】做过哪些操作出现的问题
重启从节点
【遇到的问题:问题现象及影响】
重启从节点后 服务至少20秒不可用,不符合集群高可用需求
【资源配置】
【报错完整截图】(不要大段的粘贴报错代码,论坛直接看报错代码不直观)
部署完成后 3个pod ,tdengine-0,tdengine-1,tdengine-2 。
启动完成时tdengine-0为默认mnode, 日志如下,有ERROR不知道有没有影响。
进入到tdengine-0容器内 执行taos命令 将tdengine-1,tdengine-2 都创建为mnode, tdengine-0为leader如下图
执行kubectl delete pods -n iiott tdengine-0 重启后 ,进入 tdengine-2容器执行命令,tdengine-2为leader,命令响应有失败很长时间才有响应。 如下图
此时继续执行kubectl delete pods -n iiott tdengine-0 重启后,进入 tdengine-2容器执行命令 接近30秒才响应如下图
tdengine-2 日志如下图,截取的其中有报错的
apiVersion: v1
kind: ConfigMap
metadata:
name: entrypoint
data:
entrypoint.sh: |
#!/bin/sh
set -e
for TZ awareness
if [ “$TZ” != “” ]; then
ln -sf /usr/share/zoneinfo/$TZ /etc/localtime
echo $TZ >/etc/timezone
fi
# option to disable taosadapter, default is no
DISABLE_ADAPTER=${TAOS_DISABLE_ADAPTER:-0}
unset TAOS_DISABLE_ADAPTER
DISABLE_KEEPER=${TAOS_DISABLE_KEEPER:-0}
unset TAOS_DISABLE_KEEPER
DISABLE_EXPLORER=${TAOS_DISABLE_EXPLORER:-0}
unset TAOS_DISABLE_EXPLORER
# Get DATA_DIR from taosd -C
DATA_DIR=$(taosd -C | awk '/^(default|cfg_file)[[:space:]]+dataDir[[:space:]]+/ {print $NF; exit}' | sed 's|/*$||')
DATA_DIR=${DATA_DIR:-/var/lib/taos}
# Get FQDN from taosd -C
#FQDN=$(taosd -C | awk '/^(default|cfg_file)[[:space:]]+fqdn[[:space:]]+/ {print $NF; exit}' | sed 's|/*$||')
FQDN=$(taosd -C | awk '/^(default|cfg_file|env_var)[[:space:]]+fqdn[[:space:]]+/ {print $NF; exit}' | sed 's|/*$||')
# ensure the fqdn is resolved as localhost
grep "$FQDN" /etc/hosts >/dev/null || echo "127.0.0.1 $FQDN" >>/etc/hosts
# Get first ep from taosd -C
#FIRSET_EP=$(taosd -C | awk '/^(default|cfg_file)[[:space:]]+firstEp[[:space:]]+/ {print $NF; exit}' | sed 's|/*$||')
FIRSET_EP=$(taosd -C | awk '/^(default|cfg_file|env_var)[[:space:]]+firstEp[[:space:]]+/ {print $NF; exit}' | sed 's|/*$||')
# parse first ep host and port
FIRST_EP_HOST=${FIRSET_EP%:*}
FIRST_EP_PORT=${FIRSET_EP#*:}
# in case of custom server port
SERVER_PORT=$(taosd -C | awk '/^(default|cfg_file)[[:space:]]+serverPort[[:space:]]+/ {print $NF; exit}' | sed 's|/*$||')
SERVER_PORT=${SERVER_PORT:-6030}
set +e
ulimit -c unlimited
# set core files pattern, maybe failed
sysctl -w kernel.core_pattern=/corefile/core-$FQDN-%e-%p >/dev/null >&1
set -e
# if dnode has been created or has mnode ep set or the host is first ep or not for cluster, just start.
if [ -f "$DATA_DIR/dnode/dnode.json" ] ||
[ -f "$DATA_DIR/dnode/mnodeEpSet.json" ] ||
[ "$TAOS_FQDN" = "$FIRST_EP_HOST" ]; then
$@ &
# others will first wait the first ep ready.
else
if [ "$TAOS_FIRST_EP" = "" ]; then
echo "run TDengine with single node."
$@ &
fi
while true; do
es=$(taos -h $FIRST_EP_HOST -P $FIRST_EP_PORT --check | grep "^[0-9]*:")
echo ${es}
if [ "${es%%:*}" -eq 2 ]; then
echo "execute create dnode"
taos -h $FIRST_EP_HOST -P $FIRST_EP_PORT -s "create dnode \"$FQDN:$SERVER_PORT\";"
break
fi
sleep 1s
done
if ps aux | grep -v grep | grep -v entrypoint.sh | grep taosd > /dev/null; then
echo "TDengine is running"
else
$@ &
fi
fi
if [ "$DISABLE_ADAPTER" = "0" ]; then
which taosadapter >/dev/null && taosadapter &
# wait for 6041 port ready
for _ in $(seq 1 20); do
nc -z localhost 6041 && break
sleep 0.5
done
fi
if [ "$DISABLE_KEEPER" = "0" ]; then
sleep 3
which taoskeeper >/dev/null && taoskeeper &
# wait for 6043 port ready
for _ in $(seq 1 20); do
nc -z localhost 6043 && break
sleep 0.5
done
fi
which taos-explorer >/dev/null && taos-explorer
# wait for 6060 port ready
for _ in $(seq 1 20); do
nc -z localhost 6060 && break
sleep 0.5
done
apiVersion: v1
kind: Service
metadata:
name: “taosd”
labels:
app: “tdengine”
spec:
ports:
- name: tcp6030
protocol: “TCP”
port: 6030
nodePort: 32367
- name: tcp6041
protocol: “TCP”
port: 6041
nodePort: 32368
type: NodePort
selector:
app: “tdengine”
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: “tdengine”
labels:
app: “tdengine”
spec:
serviceName: “taosd”
replicas: 3
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app: “tdengine”
template:
metadata:
name: “tdengine”
labels:
app: “tdengine”
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
POD_NAME for FQDN config
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
SERVICE_NAME and NAMESPACE for fqdn resolve
- name: SERVICE_NAME
value: “taosd”
- name: STS_NAME
value: “tdengine”
- name: STS_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
TZ for timezone settings, we recommend to always set it.
- name: TZ
value: “Asia/Shanghai”
Environment variables with prefix TAOS_ will be parsed and converted into corresponding parameter in taos.cfg. For example, serverPort in taos.cfg should be configured by TAOS_SERVER_PORT when using K8S to deploy
- name: TAOS_SERVER_PORT
value: “6030”
Must set if you want a cluster.
- name: TAOS_FIRST_EP
value: “$(STS_NAME)-0.$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local:$(TAOS_SERVER_PORT)”
TAOS_FQND should always be set in k8s env.
- name: TAOS_FQDN
value: “$(POD_NAME).$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local”
- name: TAOS_ADAPTER_SML_AUTO_CREATE_DB
value: “true”
- name: TAOS_ADAPTER_LOG_ROTATION_COUNT
value: “7”
- name: TAOS_DISABLE_KEEPER
value: “1”
- name: TAOS_LOGKEEPDAYS
value: “7”
- name: TAOS_ADAPTER_LOG_ROTATION_SIZE
value: “100MB”
- name: TAOS_ADAPTER_LOG_SQL_ROTATION_SIZE
value: “100MB”
volumeMounts:
- name: taosdata
mountPath: /var/lib/taos
- name: entrypoint
mountPath: /usr/bin/entrypoint.sh
subPath: entrypoint.sh
startupProbe:
exec:
command:
- taos-check
failureThreshold: 360
periodSeconds: 10
readinessProbe:
exec:
command:
- taos-check
initialDelaySeconds: 5
timeoutSeconds: 5000
livenessProbe:
exec:
command:
- taos-check
initialDelaySeconds: 15
periodSeconds: 20
volumes:
- name: entrypoint
configMap:
name: entrypoint
defaultMode: 0755
nodeSelector:
project: iiott
volumeClaimTemplates:
- metadata:
name: taosdata
spec:
accessModes:
- “ReadWriteOnce”
storageClassName: “iai-ceph-block”
resources:
requests:
storage: “50Gi”
正常启动 3 节点情况下,3 个容器里面互相 ping3 个 td 的 fqdn,都可以正常吗?
首先确保都可以正常 ping 通,再测试停止节点看看
容器内 不支持 ping命令 ,但进入某个容器执行getent hosts tdengine-2.taosd.iiott.svc.cluster.local 是可以解析出其他节点的 ip
需要所有节点,3 个fqdn 都进行测试验证,一共验证 9 次,都正常吗
tdengine-0.taosd.iiott.svc.cluster.local
tdengine-1.taosd.iiott.svc.cluster.local
tdengine-2.taosd.iiott.svc.cluster.local
这些 FQDN 只有在 headless service 存在时 才会解析为 Pod IP。
但你用的是 NodePort service → DNS 解析不正确
StatefulSet 必须使用 headless service(clusterIP None)
NodePort 不能是 headless service
两者用途完全不同
现在你把两者混在一起了,所以 DNS 错乱,导致集群不稳定。
已修改为headless service
apiVersion: v1
kind: Service
metadata:
name: “taosd”
labels:
app: “tdengine”
spec:
ports:
- name: tcp6030
protocol: “TCP”
port: 6030
- name: tcp6041
protocol: “TCP”
port: 6041
clusterIP: None
selector:
app: “tdengine”
启动后 tdengine-0 ,tdengine-1,tdengine-2 , 3个pod, 都创建为mnode了, tdengine-0为leader, delete pod tdengine-0 后 tdengine-2为leader, 再 delete pod tdengine-0 进入pod tdengine-2 执行 taos> show mnodes; 一直超时 报错如下
pod tdengine-0 日志如下
pod tdengine-1 日志与0基本相同
pod tdengine-12日志如下
等了一段时间,还是不能 show mnodes 吗?
在每个节点过滤taosdlog日志看看:
grep ‘vgId:1‘ taosdlog*|grep -v heartb|grep ERROR

k8s配置问题,请先解决通信、DNS,TDengine的问题就解决了