Samuel's Blog | K8s troubleshoot

troubleshooting-graph
refer

error code

kubernetes:
    node:
      - TerminatedAllPods       # Terminated All Pods      (information)
      - RegisteredNode          # Node Registered          (information)*
      - RemovingNode            # Removing Node            (information)*
      - DeletingNode            # Deleting Node            (information)*
      - DeletingAllPods         # Deleting All Pods        (information)
      - TerminatingEvictedPod   # Terminating Evicted Pod  (information)*
      - NodeReady               # Node Ready               (information)*
      - NodeNotReady            # Node not Ready           (information)*
      - NodeSchedulable         # Node is Schedulable      (information)*
      - NodeNotSchedulable      # Node is not Schedulable  (information)*
      - CIDRNotAvailable        # CIDR not Available       (information)*
      - CIDRAssignmentFailed    # CIDR Assignment Failed   (information)*
      - Starting                # Starting Kubelet         (information)*
      - KubeletSetupFailed      # Kubelet Setup Failed     (warning)*
      - FailedMount             # Volume Mount Failed      (warning)*
      - NodeSelectorMismatching # Node Selector Mismatch   (warning)*
      - InsufficientFreeCPU     # Insufficient Free CPU    (warning)*
      - InsufficientFreeMemory  # Insufficient Free Mem    (warning)*
      - OutOfDisk               # Out of Disk              (information)*
      - HostNetworkNotSupported # Host Ntw not Supported   (warning)*
      - NilShaper               # Undefined Shaper         (warning)*
      - Rebooted                # Node Rebooted            (warning)*
      - NodeHasSufficientDisk   # Node Has Sufficient Disk (information)*
      - NodeOutOfDisk           # Node Out of Disk Space   (information)*
      - InvalidDiskCapacity     # Invalid Disk Capacity    (warning)*
      - FreeDiskSpaceFailed     # Free Disk Space Failed   (warning)*
    pod:
      - Pulling           # Pulling Container Image          (information)
      - Pulled            # Ctr Img Pulled                   (information)
      - Failed            # Ctr Img Pull/Create/Start Fail   (warning)*
      - InspectFailed     # Ctr Img Inspect Failed           (warning)*
      - ErrImageNeverPull # Ctr Img NeverPull Policy Violate (warning)*
      - BackOff           # Back Off Ctr Start, Image Pull   (warning)
      - Created           # Container Created                (information)
      - Started           # Container Started                (information)
      - Killing           # Killing Container                (information)*
      - Unhealthy         # Container Unhealthy              (warning)
      - FailedSync        # Pod Sync Failed                  (warning)
      - FailedValidation  # Failed Pod Config Validation     (warning)
      - OutOfDisk         # Out of Disk                      (information)*
      - HostPortConflict  # Host/Port Conflict               (warning)*
    replicationController:
      - SuccessfulCreate    # Pod Created        (information)*
      - FailedCreate        # Pod Create Failed  (warning)*
      - SuccessfulDelete    # Pod Deleted        (information)*
      - FailedDelete        # Pod Delete Failed  (warning)*

Nodes

inidivdual node shuts down
pods on that node stop running
Node Problem Detector(DaemonSet)
Collects node problems from daemons and reports them to the apiserver as NodeCondition and Event

Network

network partition
partition A thinks the nodes in partiion B are down;partition B thinks the apiserver is down
IP Address Shortage
refer

/var/log/aws-routed-eni/ipamd.log

Apiserver

server shutdown
unable to stop, update, or start new pods, services, replication controller, existing pods and services should continue to work normally, unless they depend on the Kubernetes API
apiserver backing storage lost
apiserver should fail to come up
kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
manual recovery or recreation of apiserver state necessary before apiserver is restarted
kubectl get cs

Supporting services
node controller, replication controller manager, scheduler, etc , it would be like apiserver

kubelet

software fault
crashing kubelet cannot start new pods on the node
kubelet might delete the pods or not
node marked unhealthy
replication controllers start new pods elsewhere

kube-proxy
service -> pods

volume

detach

kubectl get volumeattachment
kubectl get pvc
kubectl describe pv <> # find volume ID from 'VolumeHandle' tag
kubectl patch pvc data00-# -p '{"metadata":{"finalizers":null}}'

expand
expand pv
missing pvc, reuse pv by new pvc
refer
refer2
volume snapshots
volume snapshot

Logs

Master
/var/log/kube-apiserver.log - API Server, responsible for serving the API
/var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
/var/log/kube-controller-manager.log - Controller that manages replication controllers
Worker Nodes
/var/log/kubelet.log - Kubelet, responsible for running containers on the node
/var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing

Troulbeshoot

Internet -- LB -- k8s ingress -- k8s service -- k8s pods

cluster
kubectl top

k8s scheduling

FailedScheduling

0/200 nodes are available -> 200 nodes in total
has taint XXX -> pods cannot tolerate
did not match Pod's node affinity/selector -> affinity/selector
unschedulable -> unschedulable
had insufficient memeory
had in sufficient cpu
had volume node affinity conflict

ingress
tracing through an ingress
ingress and traffic flow
pods
1. terminationMessagePath

Pending: Generally this is because there are insufficient resources of one type or another that prevent scheduling. Look at the output of the kubectl describe …

Waiting: If a Pod is stuck in the Waiting state, then it has been scheduled to a worker node, but it can’t run on that machine. The most common cause of Waiting pods is a failure to pull the image. There are three things to check:

Crashing or unhealthy:
kubectl logs ${POD_NAME} ${CONTAINER_NAME}
kubectl logs –previous ${POD_NAME} ${CONTAINER_NAME}
kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} – ${CMD} ${ARG1} ${ARG2} … ${ARGN}

CrashLoopBackOff
refer
refer2

Gather information
Examine Events section in describe output
Check the exit code
Check readiness/liveness probes
Check common application issues

Pod is running without doing what I told it to do:
kubectl apply –validate -f mypod.yaml

Pending, object was created and stored in etcd, but failed to be scheduled. resource not enough? cpu/mem/gpu,hostport..
Waiting or ContainerCreating, image pull failed/cni failure/failed create pod sanbox, input/out error
ImagePullBackOff, pull failed
CrashLoopBackOff, started and crash/exit again --> kubectl logs --previous/kubelet and container logs on Node
Error, failed at starting, like cm, secret,pv, limitrange, securitypolicy,rbac
Terminating or Unknow, the node pod is running on  cant be reachable 
InvalidImageName
ImageInspectError, verify image failed
ErrImageNeverPull
ErrImagePull
CreateContainerConfigError
CreateContainerError
m.internalLifecycle.PreStartContainer
RunContainerError
PostStartHookError
ContainersNotInitialized
ContainersNotReady
ContainerCreating
PodInitializing
DockerDaemonNotReady
NetworkPluginNotReady

replication controllers
kubectl describe rc ${CONTROLLER_NAME}
services
Services provide load balancing across a set of pods.
kubectl get endpoints ${SERVICE_NAME}
traffic is not forwarded
If you can connect to the service, but the connection is immediately dropped, and there are endpoints in the endpoints list, it’s likely that the proxy can’t contact your pods.
Are your pods working correctly? Look for restart count, and debug pods.
Can you connect to your pods directly? Get the IP address for the Pod, and try to connect directly to that IP.
Is your application serving on the port that you configured? Kubernetes doesn’t do port remapping, so if your application serves on 8080, the containerPort field needs to be 8080.
containers
nsenter
container_id: hash
docker container ls -f name=logger -aq –no-trunc
ls /var/lib/docker/containers/[hash]
Docker captures the stdout and stderr streams from detached containers and forwards them to the configured logging driver.

ctr containers - manage containers

docker events
docker events –filter ‘event=stop’

network connection

kubectl get pods -o wide
kubectl port-forward NAME 8800 &
curl localhost:8080

docker ps |grep pod-name
PID=$(docker inspect --format '' ContainerID)
nsenter -t ${PID} -n ip addr/ping/curl/nslookup/dig

kubectl exec -it POD -- ping -c 2 $ip

sudo conntrack -L | grep $ip

pod exit code

successfully exit, like job
programe failure
SIGKILL, kill-9 or OOM
code problem, SIGSEGV
SIGTERM, docker stop
permission or cant execute
shell script failure
or 255

memory request, oom
refer

$ docker ps | grep busy | cut -d' ' -f1
8fec6c7b6119
$ docker inspect 8fec6c7b6119 --format ''
104857600
$ ps ax | grep /bin/sh
 29532 ?      Ss     0:00 /bin/sh -c while true; do sleep 2; done
$ sudo cat /proc/29532/cgroup
...
6:memory:/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11
$ sudo cat /sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.limit_in_bytes
104857600

cpu request and limit, throttling
refer
```
100m
quota
period
```
others
k8s failure stories
faq

K8s troubleshoot

Recent Posts

Tags