Multi-gpu computation, context creation overhead ?

Hi everyone,
my application works as follows :
→ initialize data on CPU
→ create 4 threads (one for each GPU)

Each thread then does the following :
→ loop over groups of data
→ launch GPU computations over current group of data
→ create a new thread to finish computations over current group of data on an idle CPU core

The whole thing takes about 7 seconds to compute when I time it via a bash script.
I’m trying to optimize my application so I ran it under the cuda profiler and noticed that it only took about 4 seconds to compute. What’s even more surprising is that when I keep the cuda profiler window open and relaunch my bash script I get a similar timing of about 4 seconds with good results.

I would very much appreciate it if someone could explain to me why there is such a discrepancy. My (humble) guess is that opening the cuda profiler already sets up some sort of connection with all the GPUs, but that does not really explain a 3 second discrepancy.

Thank you,

Guillaume

Hi everyone,
my application works as follows :
→ initialize data on CPU
→ create 4 threads (one for each GPU)

Each thread then does the following :
→ loop over groups of data
→ launch GPU computations over current group of data
→ create a new thread to finish computations over current group of data on an idle CPU core

The whole thing takes about 7 seconds to compute when I time it via a bash script.
I’m trying to optimize my application so I ran it under the cuda profiler and noticed that it only took about 4 seconds to compute. What’s even more surprising is that when I keep the cuda profiler window open and relaunch my bash script I get a similar timing of about 4 seconds with good results.

I would very much appreciate it if someone could explain to me why there is such a discrepancy. My (humble) guess is that opening the cuda profiler already sets up some sort of connection with all the GPUs, but that does not really explain a 3 second discrepancy.

Thank you,

Guillaume

Which CUDA version are you using? I vaguely remember there were some issues with this in pre-4.0 versions. The Nvidia employees in this forum probably can tell you more.

Which CUDA version are you using? I vaguely remember there were some issues with this in pre-4.0 versions. The Nvidia employees in this forum probably can tell you more.

I’m using CUDA 3.2. We do not currently have more recent installations, so I cannot test with CUDA 4.0. I’ll see if I can have it installed soon.

Thank you,

Guillaume

I’m using CUDA 3.2. We do not currently have more recent installations, so I cannot test with CUDA 4.0. I’ll see if I can have it installed soon.

Thank you,

Guillaume

I tested it under CUDA 4.0 and the problem still occurs. It was impossible to detect where the time was lost with a normal cpu side profiling (gprof, callgrind…), so I went ahead with an unreliable printf based profiling.

It turns out that a call to “cudaGetDeviceCount” at the beginning of my program takes up to three seconds to compute. A little googling gave me this result : The Official NVIDIA Forums | NVIDIA

Does anyone know if there is a more elegant solution to this problem than the one mentioned in the above thread ?

Thanks in advance,

Guillaume

(On my system it’s fast)

Anyway…

Sounds like your application is called many times.

You could try and remove cudaGetDeviceCount from your application, and instead store the results into a file, text or binary and then read in the results from that.

This way cudaGetDeviceCount only needs to run once, until hardware changes or bug is found/solved External Image :) ?

My application is not called many times since it’s only in development, but I’m trying to get the best speedup possible and going from 7 to 4 seconds would make a major difference. Removing cudaGetDeviceCount is not an option because it is running on a shared cluster so I cannot assume that the number of available devices will remain the same.

Note that if your problem is caused by the slow initialization of the CUDA driver, as was the case in the thread you linked to, then removing cudaGetDeviceCount() just moves the delay to the next CUDA function you call. It isn’t an actual fix. I’m not aware of any other workaround besides the nvidia-smi trick.

Ok, thank you for your answer, I’ll ask an admin to run nvidia-smi then.

It makes sense to run nvidia-smi in the background anyway. Then you can have the convenience of keeping your devices set to compute-exclusive.

You don’t have to run nvidia-smi in the background for that anymore.

Also, we improved context creation overhead by a lot for 4.0 final IIRC. (maybe RC2, but definitely for final)

Thank you all very much for understanding and answering my (poorly written) question.