At Dawn
Nice and warm morning. Everything is going as planned. Birds singing and all. Suddenly a ci/cd pipeline hung. We had to kill it!!
Logs look weird.
Cannot contact jenkins-slave-6t97t: java.lang.InterruptedException
wrapper script does not seem to be touching the log file in /home/jenkins/workspace/some/fake/location@tmp/durable-9eef23f3
(JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)
Manually re-started it…same. Then another, different, unrelated project…Something is really broken.
Then is the Jenkins master: 503. The reverse proxy cannot reach it. It just died. Few seconds later is back alive. Start up takes it ages. Went to my terminal, these logs are clear.
> kubectl describe pod/jenkins-666xxxxxxx-666xx -n jenkins
Status: Failed
Reason: Evicted
Message: The node was low on resource: nodefs.
We ran out of disk space. Monitoring and alerts were set up in dev and prod, not in the ci/cd cluster.
So, what happened? It took me a while to move myself around the cluster. That pod had been running for 113 days without a single issue. Then, out of the blue: Evicted!!
Lets see what Kubernetes knows.
> kubectl describe nodes
Lot’s of warnings uh…Summary: GC is not able to delete some images because they are being used in stopped containers. A Quick look:
> kubectl get pods --all-namespaces
NAMESPACE NAME STATUS AGE
ingress default-http... Running 115d
ingress nginx-ingres... Running 94d
jenkins jenkins-6486... Evicted 115d
jenkins jenkins-6486... Running 1d
jenkins jenkins-slav... Error 1d
kube-system heapster-786... Running 27d
There are non stopped pods. What!?
Moment of darkness…Then the light bulb!! Stopped containers are not from k8s. They are straight from docker. Our Jenkins is spinning up a DinD (Docker in Docker) pod as build agent so developers can run their builds inside containers. This way no custom tool needs to be installed or updated when someone gets a new idea or a new framework comes out to the market…or something.
Line up the usual suspects
A DinD pipeline file might look like this:
stage('build') {
sh(script: """
docker build . \\
--filename pipelines/Build.Dockerfile \\
--tag base:$BUILD_NUMBER
""")
}
The real magic will be inside that docker file, for instance:
FROM node:8-alpine
WORKDIR /src
COPY ./package.json ./
COPY ./yarn.lock ./
RUN yarn install
COPY . .
RUN yarn run build
All pipelines have several test stages. They look like this:
stage('test') {
sh(script: """
docker run \\
--rm \\
base:$BUILD_NUMBER \\
ash -c pipelines/test.sh
""")
}
Did you notice the --rm
? That’s it!! Some pipelines do not include this flag. Which means, when the test round is done, the container will stick around until manually deleted. Kubernetes does not know anything about this container, it is not represented in the cluster state. Perhaps there are some low level tools to detect it. I haven’t been able to find them.
Fix it!!
You cannot just knock at the cluster’s front door. Specially in aks. The nodes are not really accessible from the outside. You need to set up ssh keys with azure cli then create a pod…etc. Simpler solution: I have some test pipelines in place, the DinD agent has access to “outer” docker already.
Setup:
stage('sample') {
sh(script: """
sleep 5h
""")
}
Manually run this pipeline and you would get a DinD pod for 5h. Just get into that pod and voilĂ …we are in business.
> kubectl exec jenkins-slave-666xx -n jenkins bash
First thing first, evaluate the damage.
> df
Filesystem Use% Mounted on
/dev/something 88% /
Only 88% usage!? thought it would have been worst. In any case bad enough.
Check the list of containers:
> docker ps -a
CONTAINER ID STATUS
3f000cecf70d Exited (0) 5 days ago
94beaf1d791b Exited (0) 2 months ago
d0984fc5a35e Exited (0) 2 aeons ago
Lots and lots and lots of stopped containers, some pretty old.
If I try deleting all containers, all non stopped containers will remain unless -f
is specified:
> for c in $(docker ps -qa); do docker rm $c; done
Couple of seconds later no more stopped containers. Time for the images:
> docker image prune
This one took a lil longer. Sanity check:
> df
Filesystem Use% Mounted on
/dev/something 41% /
Repeated the process and luckily enough hit the second node. Cleaned it. Done it again but ran out of luck. Tried a couple more times, never got to the third node.
Went thru ssh pain. This command was useful for finding the IP to ssh to:
> kubectl get nodes -o custom-columns='Name:.metadata.name,Ip:.status.addresses[0].address'
Name Ip
aks-nodepool1-xxx1 10.0.0.1
aks-nodepool1-xxx2 10.0.0.2
aks-nodepool1-xxx3 10.0.0.3
Ssh-ing those IPs got to the actual hosts. The remaining hurt one was not the first, obviously. Once there commands run lil faster and you must sudo your way thru. Everything else, the same:
> for c in $(sudo docker ps -qa); do sudo docker rm $c; done
Lessons learned
Always, always put monitoring and alerts in something you depend on. Always! This cluster might not seem that important, at least not as much as prod. But it is, you see: If it falls down, we cannot do any release, or lets say the release process becomes so much more complicated and error prone.
All docker run
from all pipelines must include the --rm
flag. Perhaps we can this automatically.
Scaling down to 1 and up to 3 again would have cleaned up n-1 nodes. Something to try next time.
We need some schedule cleanup tasks. We are not using Kubernetes in its purest form so we cannot expect it to deal with problems generated outside of its realm.