cudaDeviceScheduleBlockingSync & multi-gpu How to use BlockingSync w/ multiple devices?

It’s not entirely clear to me how and when I should call cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) such that my CUDA-using thread blocks on I/O operations instead of spinning. Do I call this before I call cudaSetDevice()? I ask, because calling cudaSetDeviceFlags() after cudaSetDevice() seemed to work in CUDA 3.2, but now it causes crashes in CUDA 4.0. This suggests that I should call cudaSetDeviceFlags() before cudaSetDevice(). However, when I have a single CUDA-using process using several GPUs, the OS reports CPU utilizations that suggest polling. How do I specify cudaDeviceScheduleBlockingSync for every GPU? Is it possible to use a different sync method with different GPUs (not that I want to do this, but it is an interesting question)?

Thanks!

you need to call cudaSetDeviceFlags() on a device before doing an operation that would spawn a context on that device.

can you post the crashing repro case?

I believe that I was took quick to say “crash”-- I had forgotten that I had placed the call within cutilSafeCall(). Nevertheless, it appears that cudaDeviceScheduleBlockingSync has no affect on CPU utilization on Linux. I’ve tried this both with CUDA 3.2 and CUDA 4.0 on Linux and 100% CPU utilization occurs. This can be tested by observing the CPU consumption of a host thread during a kernel invocation compared against the time elapsed in real time. If the two numbers are roughly equal, then the host thread was obviously consuming CPU time (spinning) while the kernel was executing (this can also be observed using ‘top’). I’ve written a test tool to examine this behavior. It runs on both CUDA 3.2 and the latest CUDA 4.0.

It’s strange that this behavior would exist in both CUDA 3.2 and 4.0. Am I using BlockingSync incorrectly? Has it just been broken for a long time and it’s not a priority to fix (i.e. Linux; BlockingSync is an uncommon use-case)? Is it possible my use of clock() in the kernel is causing things to break?—seems unlikely though.

Sample Output w/ Blocking Sync:

> ./synctest -t 1000 -l 10 -b

Using BlockingSync schedule.

Setting up CUDA Device... picked 0.

...testing iteration 1...

CPU Consumption: 1.065857		Kernel Duration: 1.068687

...testing iteration 2...

CPU Consumption: 0.997434		Kernel Duration: 1.000036

...testing iteration 3...

CPU Consumption: 0.997435		Kernel Duration: 1.000034

....

Sample Output w/o Blocking Sync:

> ./synctest -t 1000 -l 10

Setting up CUDA Device... picked 0.

...testing iteration 1...

CPU Consumption: 1.053045		Kernel Duration: 1.055947

...testing iteration 2...

CPU Consumption: 0.997439		Kernel Duration: 1.000035

...testing iteration 3...

CPU Consumption: 0.997435		Kernel Duration: 1.000035

....

The source code for the test tool is attached. Please note that it makes use of the cutil functions and associated makefiles.

[UPDATE: It appears that I wasn’t setting the proper flags with cudaEventCreateWithFlags(). I needed to specify cudaEventBlockingSync in the cudaEvent_t too. Tricky API. External Image ]
synctest.zip (2.59 KB)

Agree. I recommend against using cudaDeviceScheduleBlockingSync in general – it is hard to get right (you may end up getting an unexpected spin-wait, depending on what you’re waiting for).

Instead, I recommend using the cudaEvent_t with flags cudaEventBlockingSync to effect a blocking-wait. Doing a cudaEventSynchronize on that sort of event will always get a blocking-wait. Also note that this will work even if the flag cudaDeviceScheduleBlockingSync is not specified (I wouldn’t bother setting it – it just complicates life).