Hell got loose on k8s DinD

Friday, November 23, 2018

At Dawn

Nice and warm morning. Everything is going as planned. Birds singing and all. Suddenly a ci/cd pipeline hung. We had to kill it!!
John Martin Public domain or Public domain

Logs look weird.

Cannot contact jenkins-slave-6t97t: java.lang.InterruptedException
wrapper script does not seem to be touching the log file in /home/jenkins/workspace/some/fake/location@tmp/durable-9eef23f3
(JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

Manually re-started it…same. Then another, different, unrelated project…Something is really broken.

Then is the Jenkins master: 503. The reverse proxy cannot reach it. It just died. Few seconds later is back alive. Start up takes it ages. Went to my terminal, these logs are clear.

> kubectl describe pod/jenkins-666xxxxxxx-666xx -n jenkins
# Summary
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: nodefs.

We ran out of disk space. Monitoring and alerts were set up in dev and prod, not in the ci/cd cluster.

So, what happened? It took me a while to move myself around the cluster. That pod had been running for 113 days without a single issue. Then, out of the blue: Evicted!!

Lets see what Kubernetes knows.

> kubectl describe nodes
# Too much info

Lot’s of warnings uh…Summary: GC is not able to delete some images because they are being used in stopped containers. A Quick look:

> kubectl get pods --all-namespaces
NAMESPACE 	NAME 			STATUS 	AGE
ingress 	default-http... Running 115d
ingress 	nginx-ingres... Running 94d
jenkins 	jenkins-6486... Evicted 115d
jenkins		jenkins-6486... Running 1d
jenkins 	jenkins-slav... Error 	1d
kube-system heapster-786... Running 27d
# Some other results

There are non stopped pods. What!?

Moment of darkness…Then the light bulb!! Stopped containers are not from k8s. They are straight from docker. Our Jenkins is spinning up a DinD (Docker in Docker) pod as build agent so developers can run their builds inside containers. This way no custom tool needs to be installed or updated when someone gets a new idea or a new framework comes out to the market…or something.

Line up the usual suspects

A DinD pipeline file might look like this:

stage('build') {
	sh(script: """
		docker build . \\
			--filename pipelines/Build.Dockerfile \\
			--tag base:$BUILD_NUMBER
	""")
}

The real magic will be inside that docker file, for instance:

FROM node:8-alpine

WORKDIR /src

COPY ./package.json ./
COPY ./yarn.lock ./
RUN yarn install 

COPY . .
RUN yarn run build

All pipelines have several test stages. They look like this:

stage('test') {
	sh(script: """
		docker run \\
			--rm \\
			base:$BUILD_NUMBER \\
			ash -c pipelines/test.sh
	""")
}

Did you notice the --rm? That’s it!! Some pipelines do not include this flag. Which means, when the test round is done, the container will stick around until manually deleted. Kubernetes does not know anything about this container, it is not represented in the cluster state. Perhaps there are some low level tools to detect it. I haven’t been able to find them.

Fix it!!

You cannot just knock at the cluster’s front door. Specially in aks. The nodes are not really accessible from the outside. You need to set up ssh keys with azure cli then create a pod…etc. Simpler solution: I have some test pipelines in place, the DinD agent has access to “outer” docker already.

Setup:

stage('sample') {
	sh(script: """
		sleep 5h
	""")
}

Manually run this pipeline and you would get a DinD pod for 5h. Just get into that pod and voilĂ …we are in business.

> kubectl exec jenkins-slave-666xx -n jenkins bash

First thing first, evaluate the damage.

> df
Filesystem  	Use% 	Mounted on
/dev/something	88%		/

Only 88% usage!? thought it would have been worst. In any case bad enough.

Check the list of containers:

> docker ps -a
CONTAINER ID        STATUS                  
3f000cecf70d        Exited (0) 5 days ago
94beaf1d791b        Exited (0) 2 months ago
# Many many many removed lines
d0984fc5a35e        Exited (0) 2 aeons ago

Lots and lots and lots of stopped containers, some pretty old.

If I try deleting all containers, all non stopped containers will remain unless -f is specified:

> for c in $(docker ps -qa); do docker rm $c; done

Couple of seconds later no more stopped containers. Time for the images:

> docker image prune

This one took a lil longer. Sanity check:

> df
Filesystem  	Use% 	Mounted on
/dev/something	41%		/

Repeated the process and luckily enough hit the second node. Cleaned it. Done it again but ran out of luck. Tried a couple more times, never got to the third node.

Went thru ssh pain. This command was useful for finding the IP to ssh to:

> kubectl get nodes -o custom-columns='Name:.metadata.name,Ip:.status.addresses[0].address'
Name				Ip
aks-nodepool1-xxx1	10.0.0.1
aks-nodepool1-xxx2	10.0.0.2
aks-nodepool1-xxx3	10.0.0.3

Ssh-ing those IPs got to the actual hosts. The remaining hurt one was not the first, obviously. Once there commands run lil faster and you must sudo your way thru. Everything else, the same:

> for c in $(sudo docker ps -qa); do sudo docker rm $c; done

Lessons learned

Always, always put monitoring and alerts in something you depend on. Always! This cluster might not seem that important, at least not as much as prod. But it is, you see: If it falls down, we cannot do any release, or lets say the release process becomes so much more complicated and error prone.

All docker run from all pipelines must include the --rm flag. Perhaps we can this automatically.

Scaling down to 1 and up to 3 again would have cleaned up n-1 nodes. Something to try next time.

We need some schedule cleanup tasks. We are not using Kubernetes in its purest form so we cannot expect it to deal with problems generated outside of its realm.

QC your whale

Thursday, November 8, 2018

Motivation

It is service world and there are several replicas of the same code executing in several hosts. There might be production errors that seem like a deploy might have not gone thru in some of them. When this happens, someone needs to manually check the file versions in all hosts. This might be simple when there 2 or 3, but there might be a lot more. This sounds like a task for a computer.
whale
Now, with continuous integration and delivery I see a way we can reduce the probability of inconsistencies and detect them way before they become a problem so we can raise all necessary alarms.

TL;DR

  1. Using your CI build tool inject the build number into the Dockerfile and make visible to the application via environment variable.
  2. Build a service endpoint returning that environment variable.
  3. Deploy to a staging environment.
  4. Validate the build number on staging environment
  5. Deploy to prod.
  6. Validate the build number on prod.

My toys

Long story

I will show you how I solved this problem and now they are part of the standard issue for all of my microservices.

The image…and likeness

Lets say you build your image

FROM microsoft/dotnet

ARG BUILD_NUMBER

# Dotnet build steps

# Next line is the magic step
ENV Meta__BuildNumber ${BUILD_NUMBER}

ENTRYPOINT ["dotnet", "Service.dll"]

The environment variable naming convention is dotnet core configuration’s way. With this magic step we convert the build argument into an environment variable that will be available in execution.

Jenkins: The build

~~stage('build') {
	sh(script: """
		docker build . \\
			--file pipelines/Build.Dockerfile \\
			--build-arg BUILD_NUMBER=$BUILD_NUMBER
	""")
}~~

Here we pass the argument from the build tool to Docker.

The Service…and longer part

Make sure environment variables are added to your configuration.

	configurationBuider
		.SetBasePath(env.ContentRootPath)
		.AddJsonFile("appsettings.json")
		.AddJsonFile($"appsettings.{env.EnvironmentName}.json")
		.AddEnvironmentVariables()
		.AddCommandLine(args)

Build a Meta (or any other name) class. An instance of this class will hold the actual value.

public class Meta 
{
	public int BuildNumber { get; set; }
}

Wire it up to dependency injection.

	services
		.Configure<Meta>(configuration.GetSection("Meta"))
		.AddTransient(r => r.GetRequiredService<IOptions<Meta>>().Value);

Create the service endpoint.

[Route("diagnostics")]
public class DiagnosticsController: Controller 
{
	private readonly Meta meta;
	
	public DiagnosticsController(Meta meta) 
	{
		this.meta = meta;
	}

	public IActionResult Get() 
	{
		return Ok(meta);
	}
}

I wanna see it dad!!

|--> curl staging.mycompany.com/diagnostics
{ 
	buildNumber: 209 
}

The Validator

Build an integration test accessing the endpoint

public class Integration 
{
	private readonly IConfiguration configuration;
	private readonly string serviceUrl;
	private readonly int expectedBuildNumber;
	private readonly HttpClient http;
	
	public Integration() 
	{
		var currentDirectory = Directory,GetCurrentDirectory();
		configuration = new ConfigurationBuilder()
			.SetBasePath(currentDirectory)
			.AddJsonFile("appsettings.json")
			.AddEnvironmentVariables()
			.Build();
			
		serviceUrl = configuration.GetValue<string>("Service:Url");
		expectedBuildNumber = configuration.GetValue<int>("Meta:BuildNumber");
		http = new HttpClient();
	}

	[Fact]
	[Trait("Category", "Integration")]
	public async Task Service_build_number_is_correct()
	{
		var response = await http.GetAsync($"{serviceUrl}/diagnostics");
		response.EnsureSuccessStatusCode();
		var meta = response.Content.FromJsonBytes<Meta>();

		Assert.Equal(expectedBuildNumber, meta.BuildNumber);
	}
}

This test requires 2 configuration values, Service:Url and Meta:BuildNumber. Both must be available when running the tests.

Deploy to staging

At some point the pipeline will deploy to a staging environment. Just sit tight and wait for it to finish. Give it some room just in case application data loading or rolling updates are needed.

Jenkins II: The validation

Let’s suppose we have build an image for running tests and tagged it Tests:$BUILD_NUMBER. With that image we can execute the validation. Time for the shit. Run the funky tests wild boy.

stage('validate staging') {
	sh(script: """
		docker run \\
			--env Service__Url=http://staging.mycompany.com \\
			--env Meta__BuildNumber=$BUILD_NUMBER \\
			Tests:$BUILD_NUMBER \\
			dotnet test --filter=Category~Integration
	""")
}

At this point, if this step comes out green, you are very likely to deploy to production without any issue. If it fails you might have broken staging, but not production. You are still save. With the build logs together with your environment ones you are very likely to find the issue, solve and start the pipeline straight from the top.

The only truth

If you have a good staging environment, one that is almost identical to prod, the only acceptable differences are sizes and identities, your chances of getting a successful deployment is extremely high.

In case the validation fails: man, you are in trouble. You might have broken prod, fixing it must be the highest prio. At least, with this validation you will be able to tell immediately. Make sure you find the root cause of the problem. Don’t just fix. Gather all logs, reproduce the issue, automate it.

In the end, the only thing that matters is prod. It’s what pays for the party.