CUDA contexts and pthreads

philhughes · March 4, 2008, 7:57pm

I am trying to write a program that splits up a task among several GPUs. I started with the multiGPU SDK sample and adapted it. On Windows it seems to be working fine. I only have 2 GPUs in a Windows box, so I ported it over to Linux and an S870. Now it doesn’t seem to work.

I am attaching a sample code that runs in parallel without any CUDA stuff (use the define TEST_PARALLEL). When you start a CUDA context (using cudaFree(0)) the CPU threads seem to go serial.

Output from a case without CUDA. You can see that all the threads start at the same time and finish at the same time.

Start time for GPU 0: 0.000000 (ms)

All together: GPU 0

All apart: GPU 0

Start time for GPU 3: 0.000000 (ms)

All together: GPU 3

All apart: GPU 3

Start time for GPU 2: 0.001000 (ms)

All together: GPU 2

All apart: GPU 2

Start time for GPU 1: 0.000000 (ms)

All together: GPU 1

All apart: GPU 1

End time for GPU 3: 2552.873047 (ms)

End time for GPU 2: 2553.298096 (ms)

End time for GPU 1: 2553.469971 (ms)

End time for GPU 0: 2558.946045 (ms)

Overall processing time: 2559.145020 (ms)

Output from a case with CUDA. All the threads start at the same time but they seem to run in serial. Also note that each thread only initializes the context and that takes >180 msec.

Start time for GPU 0: 0.000000 (ms)

All together: GPU 0

Start time for GPU 3: 0.001000 (ms)

Start time for GPU 2: 0.001000 (ms)

Start time for GPU 1: 0.000000 (ms)

All apart: GPU 0

All together: GPU 2

All together: GPU 3

End time for GPU 0: 187.231995 (ms)

All together: GPU 1

All apart: GPU 3

End time for GPU 3: 373.451996 (ms)

All apart: GPU 2

End time for GPU 2: 709.073975 (ms)

All apart: GPU 1

End time for GPU 1: 1043.974976 (ms)

Overall processing time: 1367.741943 (ms)

The attachment is a .cu file. I had to change the extension to upload it. I compile with

This is 64-bit CentOS 5.
multiGPU.txt (1.95 KB)

philhughes · March 5, 2008, 3:21pm

It looks as if it is the creation of the CUDA context that is messing up the parallelism. If you initialize the CUDA context and then wait “a while” and start the timers, the times are as expected.

Has anyone else done anything with multiple GPUs in Linux?