Hell got loose on k8s DinD

Friday, November 23, 2018

At Dawn

Nice and warm morning. Everything is going as planned. Birds singing and all. Suddenly a ci/cd pipeline hung. We had to kill it!!
John Martin Public domain or Public domain

Logs look weird.

Cannot contact jenkins-slave-6t97t: java.lang.InterruptedException
wrapper script does not seem to be touching the log file in /home/jenkins/workspace/some/fake/location@tmp/durable-9eef23f3
(JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

Manually re-started it…same. Then another, different, unrelated project…Something is really broken.

Then is the Jenkins master: 503. The reverse proxy cannot reach it. It just died. Few seconds later is back alive. Start up takes it ages. Went to my terminal, these logs are clear.

> kubectl describe pod/jenkins-666xxxxxxx-666xx -n jenkins
# Summary
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: nodefs.

We ran out of disk space. Monitoring and alerts were set up in dev and prod, not in the ci/cd cluster.

So, what happened? It took me a while to move myself around the cluster. That pod had been running for 113 days without a single issue. Then, out of the blue: Evicted!!

Lets see what Kubernetes knows.

> kubectl describe nodes
# Too much info

Lot’s of warnings uh…Summary: GC is not able to delete some images because they are being used in stopped containers. A Quick look:

> kubectl get pods --all-namespaces
NAMESPACE 	NAME 			STATUS 	AGE
ingress 	default-http... Running 115d
ingress 	nginx-ingres... Running 94d
jenkins 	jenkins-6486... Evicted 115d
jenkins		jenkins-6486... Running 1d
jenkins 	jenkins-slav... Error 	1d
kube-system heapster-786... Running 27d
# Some other results

There are non stopped pods. What!?

Moment of darkness…Then the light bulb!! Stopped containers are not from k8s. They are straight from docker. Our Jenkins is spinning up a DinD (Docker in Docker) pod as build agent so developers can run their builds inside containers. This way no custom tool needs to be installed or updated when someone gets a new idea or a new framework comes out to the market…or something.

Line up the usual suspects

A DinD pipeline file might look like this:

stage('build') {
	sh(script: """
		docker build . \\
			--filename pipelines/Build.Dockerfile \\
			--tag base:$BUILD_NUMBER
	""")
}

The real magic will be inside that docker file, for instance:

FROM node:8-alpine

WORKDIR /src

COPY ./package.json ./
COPY ./yarn.lock ./
RUN yarn install 

COPY . .
RUN yarn run build

All pipelines have several test stages. They look like this:

stage('test') {
	sh(script: """
		docker run \\
			--rm \\
			base:$BUILD_NUMBER \\
			ash -c pipelines/test.sh
	""")
}

Did you notice the --rm? That’s it!! Some pipelines do not include this flag. Which means, when the test round is done, the container will stick around until manually deleted. Kubernetes does not know anything about this container, it is not represented in the cluster state. Perhaps there are some low level tools to detect it. I haven’t been able to find them.

Fix it!!

You cannot just knock at the cluster’s front door. Specially in aks. The nodes are not really accessible from the outside. You need to set up ssh keys with azure cli then create a pod…etc. Simpler solution: I have some test pipelines in place, the DinD agent has access to “outer” docker already.

Setup:

stage('sample') {
	sh(script: """
		sleep 5h
	""")
}

Manually run this pipeline and you would get a DinD pod for 5h. Just get into that pod and voilĂ …we are in business.

> kubectl exec jenkins-slave-666xx -n jenkins bash

First thing first, evaluate the damage.

> df
Filesystem  	Use% 	Mounted on
/dev/something	88%		/

Only 88% usage!? thought it would have been worst. In any case bad enough.

Check the list of containers:

> docker ps -a
CONTAINER ID        STATUS                  
3f000cecf70d        Exited (0) 5 days ago
94beaf1d791b        Exited (0) 2 months ago
# Many many many removed lines
d0984fc5a35e        Exited (0) 2 aeons ago

Lots and lots and lots of stopped containers, some pretty old.

If I try deleting all containers, all non stopped containers will remain unless -f is specified:

> for c in $(docker ps -qa); do docker rm $c; done

Couple of seconds later no more stopped containers. Time for the images:

> docker image prune

This one took a lil longer. Sanity check:

> df
Filesystem  	Use% 	Mounted on
/dev/something	41%		/

Repeated the process and luckily enough hit the second node. Cleaned it. Done it again but ran out of luck. Tried a couple more times, never got to the third node.

Went thru ssh pain. This command was useful for finding the IP to ssh to:

> kubectl get nodes -o custom-columns='Name:.metadata.name,Ip:.status.addresses[0].address'
Name				Ip
aks-nodepool1-xxx1	10.0.0.1
aks-nodepool1-xxx2	10.0.0.2
aks-nodepool1-xxx3	10.0.0.3

Ssh-ing those IPs got to the actual hosts. The remaining hurt one was not the first, obviously. Once there commands run lil faster and you must sudo your way thru. Everything else, the same:

> for c in $(sudo docker ps -qa); do sudo docker rm $c; done

Lessons learned

Always, always put monitoring and alerts in something you depend on. Always! This cluster might not seem that important, at least not as much as prod. But it is, you see: If it falls down, we cannot do any release, or lets say the release process becomes so much more complicated and error prone.

All docker run from all pipelines must include the --rm flag. Perhaps we can this automatically.

Scaling down to 1 and up to 3 again would have cleaned up n-1 nodes. Something to try next time.

We need some schedule cleanup tasks. We are not using Kubernetes in its purest form so we cannot expect it to deal with problems generated outside of its realm.

No comments :

Post a Comment