Caveats of running Mesos cluster in Docker containers

 

by Aleksandr Guljajev

The Problem

In our efforts to simplify and speed-up the MiniMesos Testing Framework, we decided to move away from a Docker-in-Docker implementation to one where each node ran in their own containers. We thought running each node in their own container was the right way to run a local Mesos cluster. We assumed that it would increase transparency, simplicity and speed of the unit tests. And in theory this should have been so… in practice, however, this proved to be quite difficult and made us scratch our heads more often than I’d like to admit.

Why run Mesos cluster nodes in containers?

When developing the ElasticSearch and Logstash frameworks we were frustrated by the length of the development cycle and the time it took to spin up the cluster when we ran our automated tests.

We used the internal docker registry to store executor and scheduler Docker images. We also used a proxy container to allow communication between the nodes.

We could have also used a VM to run the cluster, but for unit testing this was not feasible. (And running unit tests regularly is an aim of ours.)

In the end, this is the picture we were striving towards:
docker@dev:~$ docker ps

 

Mesos slave (now agent) and master Docker images

We are using Docker Containerizer, which means that agent nodes have access to the host Docker client binary, Docker socket and cgroups hierarchy.

Shared PID namespace

The first problem arises due to the fact that Mesos tracks the executor containers using PID. When an agent is starting a task it then asks Docker for the PID number of the container. Looking in the /proc folder Mesos doesn’t see the process and assumes it’s dead, killing the container straightaway and marking the task as TASK_LOST. You can look at the relevant Mesos jira issue.

The issue confused us even more when most of the times the executor task logs were empty, but sometimes you’d see the application managing to emit some log statements before being inevitably slashed by Mesos.

We tried mounting /proc as a volume, but Docker does not allow it. Luckily, since Docker 1.5 one can provide an option to share PID namespace with the host, which resolved the immediate issue.

Unfortunately, docker-java didn’t have binding for the host PID mode, so we submitted a pull request which was merged very quickly, kudos to the docker-java maintainers!

Now Mesos can properly keep track of executor containers. Great.

Prefixed Mesos containers

By an unfortunate accident, we used the “mesos-” prefix when starting the containers. It turns out that Mesos itself uses this prefix to manage the containers and wrongfully kill all our containers. The solution was simple – change the containers prefix.

Now that the agents and master know about each other through the ZooKeeper node, the agents can create executor tasks and keep a track of them. Awesome.

libprocess and communication between Mesos master and the executors

The next problem that we faced was communication between executor tasks and Mesos master.
By default, the executor task is started in the host networking mode and libprocess binds itself to a randomly chosen port, but since it shares network with the host, it tries to bind on a loopback interface.

There are two options to resolve this:
1) Detect the IP address of the mesos-agent container and bind libprocess on that address.
2) Ask Mesos to run the executor tasks in the bridge networking mode instead of host, thus giving the executor task its own IP. This is described in more details on Apache Mesos JIRA.

We went for the former by providing LIBPROCESS_IP environment variable before starting an executor, but this is decided by the framework itself by configuring the Task configuration that starts the executor.

Conclusion

Researching and solving the aforementioned issues allowed us to run the full Mesos cluster in docker containers. It gave us the performance boost we hoped for, removed a lot of complexity and gave us some invaluable insight on how Mesos Containerizer works.

The following two tabs change content below.

Aleksandr Guljajev

Latest posts by Aleksandr Guljajev (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *