Performance Issues on headless server

Hello. I am having trouble running Cuda code with very poor performance on a headless server.

When I was first learning Cuda I did some basic examples on my personal laptop which has a single GeForce GT 540M and I got, good, or at least reasonable results.

But when I moved those same examples to the server it takes an awful amount of time to execute even though the server has 3 Teslas C2075 and the examples are really basic and a lot of them use a single GPU.

For example, one of the programs simply compares how much faster is to copy from/to pinned memory, by transfering 2 equal arrays, one pinned and the other one not and measuring the time with events.

In the laptop it takes, aprox. 0.160s to run while in the server takes between 4 or 5 seconds. I strongly believe that what’s taking a lot of time on the server is context creation/destruction.
I reached this conclussion after explicitly creating the context and putting one print right before
the context creation function call and one right after.
Time elapsed between those 2 prints is highly noticeable on the server, while on the laptop it’s imperceptible.

The server has Ubuntu 10.10 Server and the laptop Ubuntu 11.10. They both have Cuda 4.2.9
I’ve done the followings things on the server after some readings on the internet but the problem is still there:
1- Disable Nouveau by adding to /etc/modprobe.d/blacklist.conf the following:
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
2- Added the following script to /etc/init.d

/sbin/modprobe nvidia

if [ “$?” -eq 0 ]; then

Count the number of NVIDIA controllers found.

N3D=/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l
NVGA=/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l

N=expr $N3D + $NVGA - 1
for i in seq 0 $N; do
mknod -m 666 /dev/nvidia$i c 195 $i;

mknod -m 666 /dev/nvidiactl c 195 255

exit 1

3- Run sudo nvidia-smi -pm 1 each time I boot the server.

But, as I said, I keep getting the same poor performance. Any ideas?

This could be related to the fact that there are multiple GPUs with large memory in the system. I would suggest filing a bug with a self-contained repro app, and the precise specifications of the system on which you observe this, through the registered developer website. Thanks.

On a headless the machine the NVIDIA drivers are (usually) not loaded by default, but only after your first call to a CUDA function in your program (drivers are unloaded again when your program finishes). To “improve” your timing measurements, you can call for example cudaMalloc(&not_used, 0) first, or switch the GPU to persistence-mode using the nvidia-smi tool (this keeps the driver loaded, if I remember correctly).

Gert-jan, everytime I start the server I run sudo nvidia-smi -pm 1 which enables persistance mode if I’m not wrong. That doesn’t fix it.

sudo nvidia-smi -pm 1 should do the trick indeed.

If it did not help, then most likely njuffa is right and you have found a bug in the CUDA-ecosystem. He is (as I recall) an NVIDIA employee, so he knows best anyway ;)

Thanks both for your replies. I’ve already filled the bug form as njuffa suggested.