CUDA contexts and pthreads

I am trying to write a program that splits up a task among several GPUs. I started with the multiGPU SDK sample and adapted it. On Windows it seems to be working fine. I only have 2 GPUs in a Windows box, so I ported it over to Linux and an S870. Now it doesn’t seem to work.

I am attaching a sample code that runs in parallel without any CUDA stuff (use the define TEST_PARALLEL). When you start a CUDA context (using cudaFree(0)) the CPU threads seem to go serial.

Output from a case without CUDA. You can see that all the threads start at the same time and finish at the same time.

Output from a case with CUDA. All the threads start at the same time but they seem to run in serial. Also note that each thread only initializes the context and that takes >180 msec.

The attachment is a .cu file. I had to change the extension to upload it. I compile with

This is 64-bit CentOS 5.
multiGPU.txt (1.95 KB)

It looks as if it is the creation of the CUDA context that is messing up the parallelism. If you initialize the CUDA context and then wait “a while” and start the timers, the times are as expected.

Has anyone else done anything with multiple GPUs in Linux?