Best practices for cudaDeviceScheduleBlockingSync usage pattern on Linux

I’m not aware of demo programs or best practices documents for this.

There is the documentation of course.

When I compile the following test program with -DUSE_BLOCK and run the program and observe its behavior using top, it reports approximately 0% CPU utilization. (%CPU).

const unsigned long long my_duration = 20000000000ULL;

__global__ void kwait(unsigned long long duration){
        unsigned long long start = clock64();
        while (clock64() < start+duration);
}

int main(){

#ifdef USE_BLOCK
  cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
#endif
#ifdef USE_YIELD
  cudaSetDeviceFlags(cudaDeviceScheduleYield);
#endif
#ifdef USE_SPIN
  cudaSetDeviceFlags(cudaDeviceScheduleSpin);
#endif
  kwait<<<1,1>>>(my_duration);
  cudaDeviceSynchronize();
}

(CUDA 11.3, GTX 960, 465.19.01)

If I compile with either -DUSE_YIELD or -DUSE_SPIN I observe approximately 100% CPU utilization.

My rationale is as follows:

  • yield allows the CPU thread (waiting at cudaDeviceSynchronize()) to yield behavior to other thread work, if any, but does not put any threads “to sleep”. If there are no other threads waiting, the CPU thread will spin at the sync point.
  • spin does what you expect: there is no thread negotiation, and the CPU thread (waiting …) just spins at the sync point
  • blocking sync somehow puts the CPU thread to sleep, waking it occasionally, to check for completion of the device activity.

I don’t have anything to share beyond that. I’m generally not able to disclose things that are not documented (excepting those things which are discoverable experimentally of course). If you feel like CUDA documentation is lacking you are encouraged to file a bug.