Kubernetes 常见故障排查和处理 kubelet has no disk pressure

suiw9 2024-10-31 16:04 63 浏览 0 评论

排查命令和方式

kubectl get pods kubectl describe pods my-pod kubectl logs my-pod kubectl exec my-pod -it /bin/bash 后进入容器排查查看宿主机日志文件 /var/log/pods/* /var/log/containers/*

1 pod故障排查处理

1、查看方式：

主要通过以下命令检查

kubectl getpods -n namespace

在上图status列，我们可以看到pod容器的状态

2、查看STATUS状态

以下是status list：

Running，Succeeded，Waiting，ContainerCreating，Failed，Pending，Terminating，unknown，CrashLoopBackOff，ErrImagePull，ImagePullBackOff

status定义说明：

如出现异常状态，可查看pod日志内容

kubectl describepod 容器名 -n namespace

查看State状态

3、查看 Conditions 状态

True 表示成功，False表示失败 Initialized pod 容器初始化完毕 Ready pod 可正常提供服务 ContainersReady 容器可正常提供服务 PodScheduled pod 正在调度中，有合适的节点就会绑定，并更新到etcd Unschedulable pod 不能调度，没有找到合适的节点

如有False状态显示, 查看Events 信息

Reason 显示 Unhealthy异常，仔细查看后面的报错信息，有针对性修复

4、 Events报错信息整理如下：

(1)

Failed to pull image "xxx":
Error: image xxx not found

原因：提示拉取镜像失败，找不到镜像
解决方式：
找到可以访问的镜像地址以及正确的tag ，并修改
镜像仓库未login，需要login
K8s没有pull镜像的权限，需要开通权限再pull

(2)

Warning FailedSync Error syncing pod, skipping: failed to with RunContainerError: "GenerateRun ContainerOptions: XXX not found"

原因：此pod XXX 的 name 在 namespace下找不到，
解决方式：
需要重启pod解决，kubectl replace --force -f pod.yaml

(3)

Warning FailedSync Error syncing pod, skipping: failed to
"StartContainer" for "XXX" with RunContainerError:
"GenerateRunContainerOptions: configmaps \"XXX\" not found"

原因：Namespace下找不到 XXX命名的ConfigMap，
解决方式：
重新新建ConfigMap: kubectl create -f configmap.yaml

(4)

Warning FailedMount MountVolume.SetUp failed for volume
"kubernetes.io/secret/ " (spec.Name: "XXXsecret") pod with: secrets
"XXXsecret" not found

原因：缺失Secret
解决方式：
需要新建 Secret
kubectl create secret docker-registry secret名 --docker-server=仓库
url --docker-username=xxx --docker-password=xxx -n namespace
以下内容，如修改yaml文件后，执行kubectl apply -f pod.yaml重启pod才生效

(5)

Normal Killing Killing container with docker id XXX: pod
"XXX" container "XXX" is unhealthy, it will be killed and re-created.
容器的活跃度探测失败， Kubernetes 正在kill问题容器

原因：探测不正确，health检查的URL不正确，或者应用未响应
解决方式：
修改yaml文件内health检查的periodSeconds等数值，调大

(6)

Warning FailedCreate Error creating: pods "XXXX" is forbidden:
[maximum memory usage per Pod is XXX, but request is XXX, maximum
memory usage per Container is XXX, but request is XXX.]

原因：K8s内存限制配额小于pod使用的大小，导致报错
解决方式：
调大k8s内存配额，或者减小pod的内存大小解决

(7)

pod (XXX) failed to fit in any node
fit failure on node (XXX): Insufficient cpu

原因：node没有足够的CPU供调用，
解决方式：

需要减少pod 内cpu的使用数量,yaml内修改

(8)

FailedMount Unable to mount volumes for pod "XXX": timeout expired
waiting for volumes to attach/mount for pod "XXX"/"fail". list of
unattached/unmounted volumes=XXX
FailedSync Error syncing pod, skipping: timeout expired waiting for
volumes to attach/mount for pod "XXX"/"fail". list of
unattached/unmounted volumes=XXX

原因：pod XXX 挂载卷失败
解决方式：
需要查看下是否建了卷, volume mountPath 目录是否正确

用yaml文件建volume并mount

(9)

FailedMount Failed to attach volume "XXX" on node "XXX" with: GCE
persistent disk not found: diskName="XXX disk" zone=""

解决方式：

检查 persistent disk 是否正确创建

Yaml文件创建persistent方式如下

(10)

error: error validating "XXX.yaml": error validating data: found
invalid field resources for PodSpec; if you choose to ignore
these errors, turn validation off with --validate=false

原因：yaml文件错误，一般是多了或者少了空格导致。
解决方式：
需要校验yaml是否正确
可使用kubeval工具校验yaml

(11)容器镜像不更新

解决方式：

deployment 中指定强制更新策略 ImagePullPolicy: Always

(12)

(combined from similar events): Readiness probe failed: calico/node
is not ready: BIRD is not ready: BGP not established with: Number of
node(s) with BGP peering established = 0

原因：指定node 节点 calico网络不通，
解决方式：
检查 calico 相关镜像是否pull成功，calico-node容器是否正常启动。如镜像和容器正常，需要reset重置该节点k8s，重新加入集群

kubeadm reset
kubeadm join ip:6443 --token XXXXX.XXXXXXXXX --discovery-token-ca-cert-hash
sha256:XXXXXXXXXXXXXXXXXXX

(13)

RunPodSandbox from runtime service failed: rpc error: code = Unknown
desc = failed pulling image "gcr.io/google_containers/pause-amd64:":
Get https://gcr.io/v1/_ping: dial tcp :443: i/o timeout

原因：gcr.io被GFW墙了
解决方式：
找阿里或googlecontainer 其他可用的镜像
Docker tag 到 gcr.io/google_containers/pause-amd64

(14)

Warning FailedCreatePodSandBox 3m (x13 over 3m) kubelet, Failed create pod sandbox
执行journalctl -xe | grep cni
发现 failed to find plugin “loopback” in path [/opt/loopback/bin /usr/local/bin]

解决方式：

需要在/usr/local/bin 内复制 loopback

2 node节点故障排查处理

kubectl get node -n namespace

查看Node节点状态， STATUS Ready表示正常，NotReady不正常

注意version必须保持一致

如有NotReady问题，需要重启节点kubectl，或者重启docker 如不能解决，需要reset节点后，k8s重新join 该node

查看node日志

执行 kubectl describe node node名 -n namespace 如有 “node ip” not found 检查 node ip 是否能ping 通， node ip 或者 vip宕机引起

以下是整理的node报错信息及处理：

报错信息整理如下：

1、

The connection to the server localhost:8080 was refused - did you specify the right host or port?
执行kubectl get XXX报错
kubectl get nodes

原因：node缺少admin.conf
解决方式：
复制master上的 admin.conf到 node
Node 节点执行 echo "export KUBECONFIG=/etc/kubernetes/admin.conf">> ~/.bash_profile

2、

kubernetes nodePort不可访问

原因：一般是 iptables 或selinux 引起
解决方式：
关闭，清空 setenforce 0 iptables --flush iptables -tnat --flush service docker restart iptables -P FORWARD ACCEPT 重启docker

3、

Failed to start inotify_add_watch /sys/fs/cgroup/blkio: no space
left on device或Failed to start inotify_add_watch
/sys/fs/cgroup/cpu,cpuacct: no space left on device

原因：空间或系统参数原因
解决方式：
查看磁盘空间有无100% 执行cat /proc/sys/fs/inotify/max_user_watches /调大数值 sysctl fs.inotify.max_user_watches=1048576

4、

Failed to start reboot.target: Connection timed out

未知原因：重启报超时
解决方式：
执行 systemctl --force --force reboot

5、

System OOM encountered

原因：使用内存超限后，容器可能会被Kubernetes进行OOMKilled
解决方式：
需要调整内存，合理分配

6、

Unable to register node "" with API server: Post
https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused

原因：node无法连接或拒绝连接master
解决方式：
Node节点重启kubelet，如未恢复，需要查看node服务器上cpu 内存，硬盘等资源情况

7、

pod状态一直 Terminating
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded

原因：可能是17版本dockerd的BUG
解决方式：
systemctl daemon-reexec systemctl restart docker
如不能恢复
需要升级docker到18版本

8、

Container runtime is down,PLEG is not healthy: pleg was last seen active 10m ago; threshold is 3m0s

原因：Pod Lifecycle Event Generator Pod 生命周期事件生成器超时响应
RPC 调用过程中容器运行时响应超时或者节点上的 Pod 数量太多，导致 relist
无法在 3 分钟内完成
解决方式：
systemctl daemon-reload systemctl daemon-reexec systemctl restart docker 重启Node节点服务器如果以上都不能解决升级docker版本到最新如果还不能解决升级kubernetes到 1.16以上版本

9、

No valid private key and/or certificate found, reusing existing private key or creating a new one

原因：node 节点kubelet启动后，会向master申请csr证书，找不到证书
解决方式：
需要在master上同意证书申请

10、

failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps containe

原因：启用了swap
解决方式：
卸载swap分区后，重启 kubelet systemctl restart kubelet

11、

The node was low on resource: [DiskPressure]

登录node节点查看，磁盘空间状况

原因：node的kubelet负责顶起采集资源占用数据，并和预先设置的threshold
值进行比较，如果超过threshold值，kubelet会杀掉一些Pod来回收相关资源
解决方式：
修改 /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubecon-fig=/etc/kubernetes/bootstrap-kubelet.conf--kubeconfig=/etc/kubernetes/kubelet.conf" 配置参数 --eviction-hard=nodefs.available<5%，后续清理磁盘重启 kubelet

12、

Node节点状态unknown

查看进程，报-bash: fork: Cannot allocate memory错误
查看内存是否还有free
查看/proc/sys/kernel/pid_max 是否过小

解决方式：

增加内存，或者调大 /proc/sys/kernel/pid_max

13、

provided port is not in the valid range. The range of valid ports is 30000-32767

原因：超出nodeport端口范围，默认nodeport需要在30000-32767范围内
解决方式：
修改/etc/kubernetes/manifests/kube-apiserver.yaml 修改 --service-node-port-range= 数字重启apiserver

14、

1 node(s) had taints that the pod didn't tolerate

原因：该节点不可调度，默认master不可调度
解决方式：
kubectl describe nodes 查看状态 kubectl taint nodes node key:NoSchedule- 删除node节点不可调度

3 master故障排查处理

报错信息整理如下：

1、

unable to fetch the kubeadm-config ConfigMap: failed to get
configmap: Unauthorized

原因：token已经过期了，token默认是24小时内有效果的
解决方式：
在master节点重新生成token，重新join节点 kubeadm token create openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa –pubin -outform der2>/dev/null | openssl dgst -sha256 -hex | sed 's/^ .* //'

2、

Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

原因：权限认证报错，需要根据提示操作
解决方式：
参考控制台提示 mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config

3、

Unable to update cni config: No networks found in /etc/cni/net
Container runtime network not ready: NetworkReady=false
reason:NetworkPluginNotReady message

原因：网络CNI找不到
解决方式:
sysctl net.bridge.bridge-nf-call-iptables=1 安装flannel或者 calico网络

4、

coredns 一直处于 Pending 或者 ContainerCreating 状态

原因：网络问题引起
解决方式：
安装flannel或者 calico网络 plugin flannel does not support config version 修改/etc/cni/net.d/10-flannel.conflist 查看cniVersion版本号是否一致，不一致的话，修改成一致，或者k8s当前可支持的版本

5、

WARNING IsDockerSystemdCheck
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/

原因：缺少配置driver systemd
解决方式：
修改或创建/etc/docker/daemon.json，增加： "exec-opts": ["native.cgroupdriver=systemd"] 重启docker

6、

WARNING FileExisting-socat
[WARNING FileExisting-socat]: socat not found in system path

原因：找不到 socat
解决方式：
yum install socat

7、

Permission denied
cannot create /var/log/fluentd.log: Permission denied

原因：权限拒绝
解决方式：
关掉SElinux安全导致. 在/etc/selinux/config中将SELINUX=enforcing设置成disabled 如未解决，给与目录写权限

8、

启动apiserver失败，每次启动都是报

解决方式：

需要配置ServiceAccount
Yaml创建

9、

repository does not exist or may require 'docker login': denied:
requested access to the resource is denied

原因：node节点没有权限从harbor拉取镜像
解决方式：
需要在master节点进行授权 kubectl create secret

10、etcd启动失败

etcd: raft save state and entries error: open
/var/lib/etcd/default.etcd/member/wal/xxx.tmp: is a directory

原因：etcd member目录文件报错
解决方式：
删除相关tmp文件和目录，重启etcd服务

11、etcd节点故障

执行 etcdctl cluster-health，显示有节点unhealthy

原因：node节点etcd故障了
解决方式：
登录问题node systemctl stop etcd systemctl restart etcd 如果还是不正常需要删除数据 rm -rf /var/lib/etcd/default.etcd/member/* （记得先备份）再重启etcd 为避免出现一些不必要的问题，运维和开发人员应该有规范的去使用K8s集群，最大限度的去避免因为涉及和使用不当而引起的故障，参考以下：

Kubernetes使用规范

K8s node节点直接实现了高可用方式，用户只需要考虑master的高可用企业建议使用双master或多master的架构，避免master单点故障 K8s集群的所有节点，ntp时间一定要校准同步建议使用OVS或calico网络，不建议使用flannel，建议使用较新的稳定版本，bug较少至少1.12以上，提供ipvs模型，非仅ipatbles，性能决定要有命名规范Namespace, master, node , pod ,service ,ingress都要用相应的命名规范，避免混乱使用deployment优先，不使用RC。支持版本回滚等功能，pod使用多副本，replication配置复数使用滚动升级发布尽量通过yaml文件，或者dashboard去管理k8s。不要长期直接跑命令通过yaml文件，去限制pod的cpu,内存，空间等资源 pod内的端口尽量不要直接暴露在node，应通过service去调取云上使用loadbalance做service负载均衡, 自建k8s可以引入ingress K8s容器一定要监控建议通过kube-prometheus监控建议部署agent日志服务，node agent统一收集日志，不要用原生k8s log，最好是使用微服务sidecar

docker systemd

上一篇：「技术干货」kubelet cgroup分析 kubelet主要功能
下一篇：Docker 图形化管理与监控之Portainer

Kubernetes 常见故障排查和处理 kubelet has no disk pressure

排查命令和方式

1 pod故障排查处理

1、查看方式：

2、查看STATUS状态

3、查看 Conditions 状态

4、 Events报错信息整理如下：

2 node节点故障排查处理

相关推荐

取消回复欢迎你发表评论:

Linux:Ubuntu22.04上安装python3.11，简单易上手

宝马阿布达比分公司推出独特M4升级套件，整套升级约在20万

MATLAB中图片保存的五种方法(一)（matlab中保存图片命令）

别再傻傻搞不清楚Workstation Player和Workstation Pro的区别了

Linux上使用tinyproxy快速搭建HTTP/HTTPS代理器

如何提取、修改、强刷A卡bios a卡刷bios工具

Element Plus 的 Dialog 组件实现点击遮罩层不关闭对话框

日本组合“岚”将于2020年12月31日停止团体活动

SpringCloud OpenFeign 使用 okhttp 发送 HTTP 请求与 HTTP/2 探索

tinymce 号称富文本编辑器世界第一，大家同意么?

Kubernetes 常见故障排查和处理 kubelet has no disk pressure

排查命令和方式

1 pod故障排查处理

1、查看方式：

2、查看STATUS状态

3、查看 Conditions 状态

4、 Events报错信息整理如下：

2 node节点故障排查处理

相关推荐

取消回复欢迎 你 发表评论:

Linux:Ubuntu22.04上安装python3.11，简单易上手

宝马阿布达比分公司推出独特M4升级套件，整套升级约在20万

MATLAB中图片保存的五种方法(一)（matlab中保存图片命令）

别再傻傻搞不清楚Workstation Player和Workstation Pro的区别了

Linux上使用tinyproxy快速搭建HTTP/HTTPS代理器

如何提取、修改、强刷A卡bios a卡刷bios工具

Element Plus 的 Dialog 组件实现点击遮罩层不关闭对话框

日本组合“岚”将于2020年12月31日停止团体活动

SpringCloud OpenFeign 使用 okhttp 发送 HTTP 请求与 HTTP/2 探索

tinymce 号称富文本编辑器世界第一，大家同意么?

取消回复欢迎你发表评论: