Tag Archives: linux

Healthchecks in a Docker Swarm

This is a very geeky post for those who might be Googling for particular details of Linux containerisation technologies. Others please feel free ignore! We were searching for this information online today and couldn’t find it, so I thought I’d post it myself for the benefit of future travellers…

How happy are your containers?

In your Dockerfile, you can specify a HEALTHCHECK: a command that will be run periodically within the container to ascertain whether it seems to be basically happy.

A typical example for a container running a web server might try and retrieve the front page with curl, and exit with an error code if that fails. Something like this, perhaps:

HEALTHCHECK CMD /usr/bin/curl --fail http://localhost/ || exit 1

This will be called periodically by the Docker engine — every 30 seconds, by default — and if you look at your running containers, you can see whether the healthcheck is passing in the ‘STATUS’ field:

$ docker ps
CONTAINER ID   IMAGE           CREATED          STATUS                     NAMES
c9098f4d1933   website:latest  34 minutes ago   Up 33 minutes (healthy)    website_1

Now, you can configure this healthcheck in various ways and examine its state through the command line and other Docker utilities and APIs, but I had always thought that it wasn’t actually used for anything by Docker. But I was wrong.

If you are using Docker Swarm (which, in my opinion, not enough people do), then the swarm ensures that an instance of your container keeps running in order to provide your ‘service’. Or it may run several instances, if you’ve told the swarm to create more than one replica. If a container dies, it will be restarted, to ensure that the required number of replicas exist.

But a container doesn’t have to die in order to undergo this reincarnation. If it has a healthcheck and the healthcheck fails repeatedly, a container will be killed off and restarted by the swarm. This is a good thing, and just how it ought to work. But it’s remarkably hard to find any documentation which specifies this, and you can find disagreement on the web as to whether this actually happens, partly, I expect, because it doesn’t happen if you’re just running docker-compose.

But my colleague Nicholas and I saw some of our containers dying unexpectedly, wondered if this might be the issue, and decided to test it, as follows…

First, we needed a minimal container where we could easily change the healthcheck status. Here’s our Dockerfile:

FROM bash
RUN echo hi > /tmp/t
HEALTHCHECK CMD test -f /tmp/t
CMD bash -c "sleep 5h"

and we built our container with

docker build -t swarmtest .

When you start up this exciting container, it just goes to sleep for five hours. But it contains a little file called /tmp/t, and as long as that file exists, the healthcheck will be happy. If you then use docker exec to go into the running container and delete that file, its state will eventually change to unhealthy.

If you’re trying this, you need to be a little bit patient. By default, the check runs every 30 seconds, starting 30s after the container is launched. Then you go in and delete the file, and after the healthcheck has failed three times, it will be marked as unhealthy. If you don’t want to wait that long, there are some extra options you can add to the HEALTHCHECK line to speed things up.

OK, so let’s create a docker-compose.yml file to make use of this. It’s about as small as you can get:

version: '3.8'

services:
  swarmtest:
    image: swarmtest

You can run this using docker-compose (or, now, without the hyphen):

docker compose up

or as a swarm stack using:

docker stack deploy -c docker-compose.yml swarmtest

(You don’t need some big infrastructure to use Docker Swarm; that’s one of its joys. It can manage large numbers of machines, but if you’re using Docker Desktop, for example, you can just run docker swarm init to enable Swarm on your local laptop.)

In either case, you can then use docker ps to find the container’s ID and start the healthcheck failing with

docker exec CONTAINER_ID rm /tmp/t

And so here’s a key difference between running something under docker compose and running it with docker stack deploy. With the former, after a couple of minutes, you’ll see the container change to ‘(unhealthy)’, but it will continue to run. The healthcheck is mostly just an extra bit of decoration; possibly useful, but it can be ignored.

With Docker Swarm, however, you’ll see the container marked as unhealthy, and shortly afterwards it will be killed off and restarted. So, yes, healthchecks are important if you’re running Docker Swarm, and if your container has been built to include one and, for some reason you don’t want to use it, you need to disable it explicitly in the YAML file if you don’t want your containers to be restarted every couple of minutes.

Finally, if you have a service that takes a long time to start up (perhaps because it’s doing a data migration), you may want to configure the ‘start period’ of the healthcheck, so that it stays in ‘starting’ mode for longer and doesn’t drop into ‘unhealthy’, where it might be killed off before finishing.

Flushing your DNS cache

OK – a really geeky little tutorial, this one. If you’ve never felt the urge to flush your DNS cache, then don’t worry, that’s quite normal, many people live long and happy lives without ever doing so, and you should feel free to ignore this post and go about your other business.

A little bit of background…

DNS lookups, as many of my readers will be aware, are cached. The whole DNS system would crumble and fall if, whenever your PC needed to look up statusq.org, say, it had to go back to the domain’s name server to discover that the name corresponded to the IP address 74.55.156.82. It would need to do before it could even start to get anything from the server, so every connection would also be painfully slow. To avoid this, the information, once you’ve asked for it the first time, is probably cached by your browser, and your machine, and, if you’re at work, your company’s DNS server, and their ISP’s DNS server… and it’s only if none of those know the answer that it will go back to the statusq.org domain’s official name server – GoDaddy, in this case – to find out what’s what.

Of course, all machines need to do that from time to time, anyway, because the information may change and their copy may be out of date. Each entry in the DNS system therefore can be given a TTL – a ‘Time To Live’ – which is guidance on how frequently the cached information should be flushed away and re-fetched from the source.

On Godaddy, this defaults to one hour – really rather a short period, and since they’re the largest DNS registrar, this probably causes a lot of unnecessary traffic on the net as a whole. If you’re confident that your server is going to stay on the same IP address for a good long time, you should set your TTLs to something more substantial – perhaps a day, or even a week. This will help to distribute the load on the network, reduce the likelihood of failed connections, and, on average, speed up interactions with your server. The reason people don’t regularly set their TTL to something long is that, when you do need to change the mapping, perhaps because your server has died and you’ve had to move to a new machine, the old values may hang around in everybody’s caches for quite a while, and that can be a nuisance.

It’s useful to think about this when making DNS changes, because you, at least, will want to check fairly swiftly that the new values work OK. There’s nothing worse than making a typo in the IP address of an entry with a long TTL, and having all of your customers going to somebody else’s site for a week.

So, if you know you’re going to be making changes in the near future, shorten the TTL on your entries a few days in advance. Machines picking up the old value will then know to treat it as more temporary. You can lengthen the TTLs again once you know everything is happy.

Secondly, just before you make the change, try to avoid using the old address, for example by pointing your browser at it. This goes for new domains, too – the domain provider will probably set the DNS entry to point at some temporary page initially – and if you try out your shiny new domain name immediately, you’ll then have to wait a couple of hours before you can access your real server that way. Make the DNS change immediately, before your machine has ever looked it up and so put it in it cache and any intervening ones.

Finally, once you’ve made a change, you may be able to encourage your machines to use the new value more quickly by flushing their local caches. This won’t help so much if they are retrieving it via an ISP’s caching proxy, for example, but it’s worth a try.

Here’s how you can use the command line to flush the cache on a few different platforms. Please feel free to add any others in the comments:

On recent versions of Mac OS X:

sudo dscacheutil -flushcache

On older versions of OS X:

sudo lookupd -flushcache

On Windows:

ipconfig /flushdns

On Linux, if your machine is running the ncsd daemon:

sudo /etc/rc.d/init.d/nscd restart

If you’re actually running a DNS server, for example for your organisation’s local network:

On Linux running bind9:

rndc flush

On Linux running bind8:

ndc flush

On Ubuntu/Debian running named:

/etc/init.d/named restart

© Copyright Quentin Stafford-Fraser