cudaDeviceScheduleBlockingSync & multi-gpu How to use BlockingSync w/ multiple devices?

Arakageeta · April 10, 2011, 12:32am

It’s not entirely clear to me how and when I should call cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) such that my CUDA-using thread blocks on I/O operations instead of spinning. Do I call this before I call cudaSetDevice()? I ask, because calling cudaSetDeviceFlags() after cudaSetDevice() seemed to work in CUDA 3.2, but now it causes crashes in CUDA 4.0. This suggests that I should call cudaSetDeviceFlags() before cudaSetDevice(). However, when I have a single CUDA-using process using several GPUs, the OS reports CPU utilizations that suggest polling. How do I specify cudaDeviceScheduleBlockingSync for every GPU? Is it possible to use a different sync method with different GPUs (not that I want to do this, but it is an interesting question)?

Thanks!

tmurray · April 10, 2011, 5:38pm

you need to call cudaSetDeviceFlags() on a device before doing an operation that would spawn a context on that device.

can you post the crashing repro case?

Arakageeta · April 12, 2011, 12:16am

I believe that I was took quick to say “crash”-- I had forgotten that I had placed the call within cutilSafeCall(). Nevertheless, it appears that cudaDeviceScheduleBlockingSync has no affect on CPU utilization on Linux. I’ve tried this both with CUDA 3.2 and CUDA 4.0 on Linux and 100% CPU utilization occurs. This can be tested by observing the CPU consumption of a host thread during a kernel invocation compared against the time elapsed in real time. If the two numbers are roughly equal, then the host thread was obviously consuming CPU time (spinning) while the kernel was executing (this can also be observed using ‘top’). I’ve written a test tool to examine this behavior. It runs on both CUDA 3.2 and the latest CUDA 4.0.

It’s strange that this behavior would exist in both CUDA 3.2 and 4.0. Am I using BlockingSync incorrectly? Has it just been broken for a long time and it’s not a priority to fix (i.e. Linux; BlockingSync is an uncommon use-case)? Is it possible my use of clock() in the kernel is causing things to break?—seems unlikely though.

Sample Output w/ Blocking Sync:

> ./synctest -t 1000 -l 10 -b

Using BlockingSync schedule.

Setting up CUDA Device... picked 0.

...testing iteration 1...

CPU Consumption: 1.065857		Kernel Duration: 1.068687

...testing iteration 2...

CPU Consumption: 0.997434		Kernel Duration: 1.000036

...testing iteration 3...

CPU Consumption: 0.997435		Kernel Duration: 1.000034

....

Sample Output w/o Blocking Sync:

> ./synctest -t 1000 -l 10

Setting up CUDA Device... picked 0.

...testing iteration 1...

CPU Consumption: 1.053045		Kernel Duration: 1.055947

...testing iteration 2...

CPU Consumption: 0.997439		Kernel Duration: 1.000035

...testing iteration 3...

CPU Consumption: 0.997435		Kernel Duration: 1.000035

....

The source code for the test tool is attached. Please note that it makes use of the cutil functions and associated makefiles.

[UPDATE: It appears that I wasn’t setting the proper flags with cudaEventCreateWithFlags(). I needed to specify cudaEventBlockingSync in the cudaEvent_t too. Tricky API. External Image ]
synctest.zip (2.59 KB)

Christopher_Cameron · April 13, 2011, 6:45am

Agree. I recommend against using cudaDeviceScheduleBlockingSync in general – it is hard to get right (you may end up getting an unexpected spin-wait, depending on what you’re waiting for).

Instead, I recommend using the cudaEvent_t with flags cudaEventBlockingSync to effect a blocking-wait. Doing a cudaEventSynchronize on that sort of event will always get a blocking-wait. Also note that this will work even if the flag cudaDeviceScheduleBlockingSync is not specified (I wouldn’t bother setting it – it just complicates life).

Topic		Replies	Views
100% CPU Usage - Linux CUDA Programming and Performance	4	2733	February 12, 2018
letting the host thread sleep in 2.2? CUDA Programming and Performance	8	4315	July 1, 2009
CPU core is busy while GPU runs its kernel CUDA Programming and Performance	11	5240	February 11, 2018
cudaDeviceSynchronize blocking effect cudaDeviceScheduleBlockingSync CUDA Programming and Performance	3	6628	June 30, 2012
cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment CUDA Programming and Performance	2	3381	September 27, 2013
Does cudaDeviceReset() wait for operation completion on the device? CUDA Programming and Performance	5	745	December 27, 2023
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23202	October 7, 2010
High host CPU load CUDA Programming and Performance	15	1664	December 15, 2021
cpu usage while waiting for kernel CUDA Programming and Performance	4	8918	August 1, 2009
Best practices for cudaDeviceScheduleBlockingSync usage pattern on Linux CUDA Programming and Performance	5	3284	June 14, 2021

cudaDeviceScheduleBlockingSync & multi-gpu How to use BlockingSync w/ multiple devices?

Related topics