K20c with CoreOS

I have been working on deploying a CUDA application on my CoreOS Server. As most of CoreOS is on a read only file system I have been trying to create an Ubuntu 16.04 container with the CUDA drivers. I didnt find success with any of the dockerfiles I’ve found online. So Ive been modifying the GitHub - emergingstack/es-dev-stack: An on-premises, bare-metal solution for deploying GPU-powered applications in containers to try make it work. Ive changed to used driver version 375.26 (Included with CUDA 8). Because the issue with nvprocfs has been fixed, I commented out those lines and went for a compile. However there are errors with NV_GET_USER_PAGES_REMOTE. Apparently some problem with nv-linux.h How do I fix the errors and get the driver to run?

In file included from /opt/nvidia/nvidia_installers/NVIDIA-Linux-x86_64-375.26/kernel/common/inc/nv-linux.h:18:0,
from /opt/nvidia/nvidia_installers/NVIDIA-Linux-x86_64-375.26/kernel/nvidia/nv-p2p.c:15:
/opt/nvidia/nvidia_installers/NVIDIA-Linux-x86_64-375.26/kernel/common/inc/nv-mm.h: In function ‘NV_GET_USER_PAGES_REMOTE’:
/opt/nvidia/nvidia_installers/NVIDIA-Linux-x86_64-375.26/kernel/common/inc/nv-mm.h:86:20: error: too few arguments to function ‘get_user_pages_remote’
return get_user_pages_remote(tsk, mm, start, nr_pages, flags, pages, vmas);
^

This article has some content that might be handy:

It looks like the normal solution is to use nvidia-docker, but that’s not great (particularly if you use an orchestration solution like Kubernetes). The article suggests some flags you can pass in order to emulate the behavior on a normal Docker image.

Clarif.ai also has a good article on creating a custom build of CoreOS for GPU programming. They have it linked near the bottom of the piece, here: Clarifai Blog

Full disclosure: I’m a community manager at CoreOS.

Creating a container that includes the driver and Cuda toolkit is rather straight forward:

The problem on CoreOS is rather that you need to compile the driver with the correct kernel sources. And by default you don’t have a compiler toolchain at all that fits the host system.

Having said that, there is a CoreOS developer image that can be used to create the proper kernel modules. I converted it to a Docker image which I build automatically for each CoreOS release:

Using that I compile the driver like so:
https://github.com/BugRoger/dockerfiles/tree/master/coreos-nvidia-installer

Same for Cuda:

Using these two containers you can install the correct modules and toolkit on your CoreOS host. Due to the write protected /usr you have two options. Use an overlay to map the driver into /usr/lib. I found that a bit problematic due to the timing how the depmod database is being loaded. Second option, just install to /opt/… With a smart systemd unit you can even do that automatically on boot, so you even have the correct kernel module when the host auto-updates to a newer kernel.

In my containers I mount both the driver and toolkit from the host. This saves you the trouble of baking a specific driver into the container that doesn’t match the loaded kernel. And you don’t need to shuffle 1.5GB container images around.

To answer your question. If you add the following to your container:

RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf 
RUN echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH $PATH:/usr/local/nvidia/bin:/usr/local/cuda/bin
ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/local/nvidia/lib64/:/usr/local/cuda/lib64/

And mount /opt/nvidia|cuda/current to /usr/local/nvidia|cuda in your container. Which you install with the above installers, you should have a minimal container with a perfectly working nvidia installation