The postman always rings twice

Wednesday, April 8, 2020

Intro

In the microservice world, very often, a service depends on another service. Which in turn might depend on others and so on. A common practice is to let a service die if its dependencies are not ready when it starts then leave the responsibility to restart it to the orchestration platform. There are cases where a restart is not possible or wanted.

What if instead of dying, we make the service make sure the dependency is ready before actually starting its feature presentation?

So let’s

I created a bash script that waits until the service becomes available.

until ping -c1 $HOST
do
	sleep 10s
done

This implementation exits the loop as soon as the host becomes reachable by ping. The ping can be replaced by any command that would exit with code 0 when the dependency is avaliable and not 0 otherwise.

But just twice

Off course, you might not wait forever. Let’s say after a minute or so, exit with an error. Another thing is perhaps not waiting a constant time but increase the time between waits.

WAIT_TIME_SECONDS=10
until ping -c1 $HOST
do
	sleep ${WAIT_TIME_SECONDS}s
	
	WAIT_TIME_SECONDS=$(( 2  *  $WAIT_TIME_SECONDS ))

	if  (( $WAIT_TIME_SECONDS  >  $MAX_WAIT ))
	then
		exit 255
	fi
done

Feed back

You might want to discard output from each of the failing command execution.

until ping -c1 $HOST &>/dev/null
do
	# ...
done

You might also want to know each time an iteration started.

	echo Waiting ${WAIT_TIME_SECONDS}s ...

Mongo Bonus

I needed this for a mongo server, the thing is ping responds way before mongo is available and therefore not good. Finding the right command per each case is important.

until mongo --host $HOST --eval db
do
	# ... 
done

Outro

Dealing with dependencies is becoming more and more complicated as we break problems into smaller ones. Finding ways to make your life easier will make the difference.

Divide that HOC and HOConquer

Thursday, May 30, 2019

Intro

When creating react+redux web sites I usually create HOCs to connect with the store. These HOCs are usually pretty uncomfortable to be used: They need other HOCs or produce too much traffic or complicated flows. All situations can be avoided, but at that point the HOC becomes harder to be created.

Exhibit #1

This is what I mean:

type ExternalProps = RouteComponentProps<{id: string}>

export type Props = State & typeof actionCreators

export const withCurrentItem = <TOriginalProps extends {} = {}>(
	Inner: React.ComponentType<TOriginalProps & ExternalProps & Props>) => {
	type ResultProps = TOriginalProps & ExternalProps
	type WrapperProps = ResultProps & Props

	class Wrapper extends React.Component<WrapperProps> {
		public componentWillMount() {
			this.props.loadIfNeeded(this.props.match.params.id)
		}

		public render() {
			return (<Inner {...this.props} />)
		}
	}

	return connect<State, typeof actionCreators, ResultProps>(
		(state: ApplicationState) => state.currentItem,
		actionCreators
	)(Wrapper)
}

The problem with this HOC is that it does too many things. Let see:

Connects to store providing state and actionCreators
Asks actionCreators to load if needed
Receive parameters from router, that you’d need to provide later by applying another HOC

Using this HOC might look something like:

const Page = withRouter(withCurrentItem(StatelessPage))

That might be kind of OK for the page but it will become cumbersome as we need to do it for many other smaller components around and perhaps at that point we don’t need neither the actions nor the loading, nor the router.

So let’s break it down.

Store: State + Actions

We will create a HOC that just connects to the store. Someone else must make sure the store loads current item. Components enriched with this HOC will have read access to current item and will be able to perform actions with it.

export type Props = State & typeof actionCreators

export const withCurrentItem = <TOriginalProps extends {} = {}>(
	Inner: React.ComponentType<TOriginalProps & Props>) => {
	type ResultProps = TOriginalProps
	type WrapperProps = ResultProps & Props

	class Wrapper extends React.Component<WrapperProps> {
		public render() {
			return (<Inner {...this.props} />)
		}
	}

	return connect<State, typeof actionCreators, ResultProps>(
		(state: ApplicationState) => state.currentItem,
		actionCreators
	)(Wrapper)
}

using it:

const NiceItemWidgetWithDeleteButton = withCurrentItem(StatelessNiceItemWidgetWithDeleteButton)

Autoload

This one will take care of loading current item, for instance on top level components.

export const autoLoadCurrentItem = <TOriginalProps extends {} = {}>(
	Inner: React.ComponentType<TOriginalProps>) => {
	
	type ResultProps = TOriginalProps
	type WrapperProps = ResultProps
		& typeof actionCreators
		& State
		& RouteComponentProps<{ id: ItemId }>

	class Wrapper extends React.Component<WrapperProps> {
		public componentDidMount() {
			const props = this.props
			props.loadIfNeeded(props.match.params.id)
		}
		public render() {
			const props = this.props
			return <Inner { ...props }/>
		}
	}
	
	const WrapperWithRouter = withRouter(Wrapper)
	return connect<State, typeof actionCreators, WrapperProps>(
		(state: ApplicationState) => state.currentItem,
		actionCreators
	)(WrapperWithRouter)
}

Using it:

const Page = autoLoadCurrentItem(InnerPage)

Outro

Single responsibility principle applies everywhere. By splitting you get a much more intuitive code base. It is also easier to build more complex logic into a HOC that does only one thing. Client code will also be simpler if you don’t need to worry about other things the HOC needs to work properly.

Whale inception

Tuesday, January 8, 2019

Docker in Docker (DinD) and Docker outside of Docker (DooD) are the technique where you run a docker container inside another container. And could potentially run another, and another, and another…until you run out of resources.

Inception

Over the last couple of month I have found several handy uses for this. I plan to write about those soon. In this post I will center in the actual thing.

Simple experiment

There are many images out there, most with specific purpose. Let’s start with a simple one.

> docker run -it --rm \
	-v /var/run/docker.sock:/var/run/docker.sock \
	docker
# We are now inside the inner docker
> docker run -it --rm alpine
# We are now inside the alpine that seems to be inside the inner
> echo Hello inner alpine!!

Note the volume mount, it’s very important , will talk about that later.

Now, without exiting from none of those containers open another terminal:

> docker ps
CONTAINER ID	IMAGE 	COMMAND               
30f92b110fd6    alpine	"/bin/sh"             
504490e3e1b7    docker	"docker-entrypoint.s…"

And…there are two containers on the outer docker.

If you connect to the inner and execute the same command:

> docker exec -it 504490e3e1b7 ash
# We are now inside the inner
> docker ps
CONTAINER ID	IMAGE 	COMMAND               
30f92b110fd6    alpine	"/bin/sh"             
504490e3e1b7    docker	"docker-entrypoint.s…"

You will get same output.

Oneiric free fall

To start playing the inception game you could get into the inner and from there get inside the inner…and again…and again.

> docker exec -it 504490e3e1b7 ash
# We are now inside the inner
> docker exec -it 504490e3e1b7 ash
# We are now inside the inner from inside the inner
> docker exec -it 504490e3e1b7 ash
# We are now inside the inner inside the inner inside the inner

In the end it’s like opening a command line from a command line…and repeat until very tired.

The other “game” you could play is actually running a new inner from the inner

> docker run -it --rm \
	-v /var/run/docker.sock:/var/run/docker.sock \
	docker
# We are now inside a new inner
> docker run -it --rm \
	-v /var/run/docker.sock:/var/run/docker.sock \
	docker
# We are now inside a deeper inner
> docker run -it --rm \
	-v /var/run/docker.sock:/var/run/docker.sock \
	docker
# We are now inside an even deeper inner

At this point we have 3, apparently nested, inners.

> docker ps
CONTAINER ID	IMAGE 	COMMAND                
ec0a384361fb	docker	"docker-entrypoint.s…"
28b580875eea	docker	"docker-entrypoint.s…"
e8e61c5ffb6b	docker	"docker-entrypoint.s…"

Nested: Well…actually

A docker command line interface (CLI) running in the host will send commands to the Docker Engine API, then the engine will spin up containers, and kill them and so on. A inner container will look like it’s running docker. It will have a CLI and all, but it’s commands will be sent to the the outer Docker Engine, the one in the host…The only one.
There is no spoon

That communication is achieved with the volume mount I mention earlier: -v /var/run/docker.sock:/var/run/docker.sock. That maps the unix sock from the host into the inner. At this point, whether the image you are using actually contains a docker engine or just a CLI won’t matter. It will always “speak” with the host’s engine.

Pitfalls

First and most important: Resist the temptation of running an actual docker inside a container. That approach is very very tricky-hacky-flacky look here. You can use DinD or DooD images safely as long as you use the sock volume mount.

Volumes must match the host not the container.

> docker run -it --rm \
	-v /var/run/docker.sock:/var/run/docker.sock \
	-v /home/user/data:/data \
	docker
# We are now inside the inner
> docker run -it --rm \
	-v /data:/data \
	alpine
# And if it does not fail, the result at least won't be what was expected.

In previous excerpt, /data in the alpine does not map to /data in the inner and therefore does not map to /home/user/data in the host. It maps to /data in the host. So if there was no /data an empty one would be created, as usual when mounting volumes. If there was a /data in the host, it will be mounted.

Outro

DinD or DooD can be tricky and hard to is very powerful. I think, if your are diving into docker, you should give it a try. Once you have you will understand much better what is really going on. Then, when it’s all crystal clear, I am sure you would get some juice out of it.

With this technique you can run your CI/CD pipelines without requiring to install anything in your agents, I will write another post about that soon.

You could orchestrate automated tasks, again, with no need for customizing your orchestrator nor it’s distributed agents.

Hell got loose on k8s DinD

Friday, November 23, 2018

At Dawn

Nice and warm morning. Everything is going as planned. Birds singing and all. Suddenly a ci/cd pipeline hung. We had to kill it!!
John Martin Public domain or Public domain

Logs look weird.

Cannot contact jenkins-slave-6t97t: java.lang.InterruptedException
wrapper script does not seem to be touching the log file in /home/jenkins/workspace/some/fake/location@tmp/durable-9eef23f3
(JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

Manually re-started it…same. Then another, different, unrelated project…Something is really broken.

Then is the Jenkins master: 503. The reverse proxy cannot reach it. It just died. Few seconds later is back alive. Start up takes it ages. Went to my terminal, these logs are clear.

> kubectl describe pod/jenkins-666xxxxxxx-666xx -n jenkins
# Summary
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: nodefs.

We ran out of disk space. Monitoring and alerts were set up in dev and prod, not in the ci/cd cluster.

So, what happened? It took me a while to move myself around the cluster. That pod had been running for 113 days without a single issue. Then, out of the blue: Evicted!!

Lets see what Kubernetes knows.

> kubectl describe nodes
# Too much info

Lot’s of warnings uh…Summary: GC is not able to delete some images because they are being used in stopped containers. A Quick look:

> kubectl get pods --all-namespaces
NAMESPACE 	NAME 			STATUS 	AGE
ingress 	default-http... Running 115d
ingress 	nginx-ingres... Running 94d
jenkins 	jenkins-6486... Evicted 115d
jenkins		jenkins-6486... Running 1d
jenkins 	jenkins-slav... Error 	1d
kube-system heapster-786... Running 27d
# Some other results

There are non stopped pods. What!?

Moment of darkness…Then the light bulb!! Stopped containers are not from k8s. They are straight from docker. Our Jenkins is spinning up a DinD (Docker in Docker) pod as build agent so developers can run their builds inside containers. This way no custom tool needs to be installed or updated when someone gets a new idea or a new framework comes out to the market…or something.

Line up the usual suspects

A DinD pipeline file might look like this:

stage('build') {
	sh(script: """
		docker build . \\
			--filename pipelines/Build.Dockerfile \\
			--tag base:$BUILD_NUMBER
	""")
}

The real magic will be inside that docker file, for instance:

FROM node:8-alpine

WORKDIR /src

COPY ./package.json ./
COPY ./yarn.lock ./
RUN yarn install 

COPY . .
RUN yarn run build

All pipelines have several test stages. They look like this:

stage('test') {
	sh(script: """
		docker run \\
			--rm \\
			base:$BUILD_NUMBER \\
			ash -c pipelines/test.sh
	""")
}

Did you notice the --rm? That’s it!! Some pipelines do not include this flag. Which means, when the test round is done, the container will stick around until manually deleted. Kubernetes does not know anything about this container, it is not represented in the cluster state. Perhaps there are some low level tools to detect it. I haven’t been able to find them.

Fix it!!

You cannot just knock at the cluster’s front door. Specially in aks. The nodes are not really accessible from the outside. You need to set up ssh keys with azure cli then create a pod…etc. Simpler solution: I have some test pipelines in place, the DinD agent has access to “outer” docker already.

Setup:

stage('sample') {
	sh(script: """
		sleep 5h
	""")
}

Manually run this pipeline and you would get a DinD pod for 5h. Just get into that pod and voilà…we are in business.

> kubectl exec jenkins-slave-666xx -n jenkins bash

First thing first, evaluate the damage.

> df
Filesystem  	Use% 	Mounted on
/dev/something	88%		/

Only 88% usage!? thought it would have been worst. In any case bad enough.

Check the list of containers:

> docker ps -a
CONTAINER ID        STATUS                  
3f000cecf70d        Exited (0) 5 days ago
94beaf1d791b        Exited (0) 2 months ago
# Many many many removed lines
d0984fc5a35e        Exited (0) 2 aeons ago

Lots and lots and lots of stopped containers, some pretty old.

If I try deleting all containers, all non stopped containers will remain unless -f is specified:

> for c in $(docker ps -qa); do docker rm $c; done

Couple of seconds later no more stopped containers. Time for the images:

> docker image prune

This one took a lil longer. Sanity check:

> df
Filesystem  	Use% 	Mounted on
/dev/something	41%		/

Repeated the process and luckily enough hit the second node. Cleaned it. Done it again but ran out of luck. Tried a couple more times, never got to the third node.

Went thru ssh pain. This command was useful for finding the IP to ssh to:

> kubectl get nodes -o custom-columns='Name:.metadata.name,Ip:.status.addresses[0].address'
Name				Ip
aks-nodepool1-xxx1	10.0.0.1
aks-nodepool1-xxx2	10.0.0.2
aks-nodepool1-xxx3	10.0.0.3

Ssh-ing those IPs got to the actual hosts. The remaining hurt one was not the first, obviously. Once there commands run lil faster and you must sudo your way thru. Everything else, the same:

> for c in $(sudo docker ps -qa); do sudo docker rm $c; done

Lessons learned

Always, always put monitoring and alerts in something you depend on. Always! This cluster might not seem that important, at least not as much as prod. But it is, you see: If it falls down, we cannot do any release, or lets say the release process becomes so much more complicated and error prone.

All docker run from all pipelines must include the --rm flag. Perhaps we can this automatically.

Scaling down to 1 and up to 3 again would have cleaned up n-1 nodes. Something to try next time.

We need some schedule cleanup tasks. We are not using Kubernetes in its purest form so we cannot expect it to deal with problems generated outside of its realm.

QC your whale

Thursday, November 8, 2018

Motivation

It is service world and there are several replicas of the same code executing in several hosts. There might be production errors that seem like a deploy might have not gone thru in some of them. When this happens, someone needs to manually check the file versions in all hosts. This might be simple when there 2 or 3, but there might be a lot more. This sounds like a task for a computer.
whale
Now, with continuous integration and delivery I see a way we can reduce the probability of inconsistencies and detect them way before they become a problem so we can raise all necessary alarms.

TL;DR

Using your CI build tool inject the build number into the Dockerfile and make visible to the application via environment variable.
Build a service endpoint returning that environment variable.
Deploy to a staging environment.
Validate the build number on staging environment
Deploy to prod.
Validate the build number on prod.

My toys

Long story

I will show you how I solved this problem and now they are part of the standard issue for all of my microservices.

The image…and likeness

Lets say you build your image

FROM microsoft/dotnet

ARG BUILD_NUMBER

# Dotnet build steps

# Next line is the magic step
ENV Meta__BuildNumber ${BUILD_NUMBER}

ENTRYPOINT ["dotnet", "Service.dll"]

The environment variable naming convention is dotnet core configuration’s way. With this magic step we convert the build argument into an environment variable that will be available in execution.

Jenkins: The build

~~stage('build') {
	sh(script: """
		docker build . \\
			--file pipelines/Build.Dockerfile \\
			--build-arg BUILD_NUMBER=$BUILD_NUMBER
	""")
}~~

Here we pass the argument from the build tool to Docker.

The Service…and longer part

Make sure environment variables are added to your configuration.

	configurationBuider
		.SetBasePath(env.ContentRootPath)
		.AddJsonFile("appsettings.json")
		.AddJsonFile($"appsettings.{env.EnvironmentName}.json")
		.AddEnvironmentVariables()
		.AddCommandLine(args)

Build a Meta (or any other name) class. An instance of this class will hold the actual value.

public class Meta 
{
	public int BuildNumber { get; set; }
}

Wire it up to dependency injection.

	services
		.Configure<Meta>(configuration.GetSection("Meta"))
		.AddTransient(r => r.GetRequiredService<IOptions<Meta>>().Value);

Create the service endpoint.

[Route("diagnostics")]
public class DiagnosticsController: Controller 
{
	private readonly Meta meta;
	
	public DiagnosticsController(Meta meta) 
	{
		this.meta = meta;
	}

	public IActionResult Get() 
	{
		return Ok(meta);
	}
}

I wanna see it dad!!

|--> curl staging.mycompany.com/diagnostics
{ 
	buildNumber: 209 
}

The Validator

Build an integration test accessing the endpoint

public class Integration 
{
	private readonly IConfiguration configuration;
	private readonly string serviceUrl;
	private readonly int expectedBuildNumber;
	private readonly HttpClient http;
	
	public Integration() 
	{
		var currentDirectory = Directory,GetCurrentDirectory();
		configuration = new ConfigurationBuilder()
			.SetBasePath(currentDirectory)
			.AddJsonFile("appsettings.json")
			.AddEnvironmentVariables()
			.Build();
			
		serviceUrl = configuration.GetValue<string>("Service:Url");
		expectedBuildNumber = configuration.GetValue<int>("Meta:BuildNumber");
		http = new HttpClient();
	}

	[Fact]
	[Trait("Category", "Integration")]
	public async Task Service_build_number_is_correct()
	{
		var response = await http.GetAsync($"{serviceUrl}/diagnostics");
		response.EnsureSuccessStatusCode();
		var meta = response.Content.FromJsonBytes<Meta>();

		Assert.Equal(expectedBuildNumber, meta.BuildNumber);
	}
}

This test requires 2 configuration values, Service:Url and Meta:BuildNumber. Both must be available when running the tests.

Deploy to staging

At some point the pipeline will deploy to a staging environment. Just sit tight and wait for it to finish. Give it some room just in case application data loading or rolling updates are needed.

Jenkins II: The validation

Let’s suppose we have build an image for running tests and tagged it Tests:$BUILD_NUMBER. With that image we can execute the validation. Time for the shit. Run the funky tests wild boy.

stage('validate staging') {
	sh(script: """
		docker run \\
			--env Service__Url=http://staging.mycompany.com \\
			--env Meta__BuildNumber=$BUILD_NUMBER \\
			Tests:$BUILD_NUMBER \\
			dotnet test --filter=Category~Integration
	""")
}

At this point, if this step comes out green, you are very likely to deploy to production without any issue. If it fails you might have broken staging, but not production. You are still save. With the build logs together with your environment ones you are very likely to find the issue, solve and start the pipeline straight from the top.

The only truth

If you have a good staging environment, one that is almost identical to prod, the only acceptable differences are sizes and identities, your chances of getting a successful deployment is extremely high.

In case the validation fails: man, you are in trouble. You might have broken prod, fixing it must be the highest prio. At least, with this validation you will be able to tell immediately. Make sure you find the root cause of the problem. Don’t just fix. Gather all logs, reproduce the issue, automate it.

In the end, the only thing that matters is prod. It’s what pays for the party.

Subscribe to: Posts ( Atom )

Quality Assurance for Code

The postman always rings twice

Intro

So let’s

But just twice

Feed back

Mongo Bonus

Outro

Divide that HOC and HOConquer

Intro

Exhibit #1

Store: State + Actions

Autoload

Outro

Whale inception

Simple experiment

Oneiric free fall

Nested: Well…actually

Pitfalls

Outro

Hell got loose on k8s DinD

At Dawn

Line up the usual suspects

Fix it!!

Lessons learned

QC your whale

Motivation

TL;DR

My toys

Long story

The image…and likeness

Jenkins: The build

The Service…and longer part

I wanna see it dad!!

The Validator

Deploy to staging

Jenkins II: The validation

The only truth

GoTo

Popular

Archives