Questions about CUDA

Hello.

If I call cudaStreamSynchronize in OpenMP region, does it block all OpenMP threads? Or does it only block the thread that called it?

If I generate more blocks than the GPU actually generates, how does the block indexing work?

For example, If I launch 10 blocks and the GPU actually generate 5 blocks due to limitations in registers, do 6 ~ 10 appear if I print blockIdx%x?

Is it safe to use ACC_SET_DEVICE_NUM and cudaSetDevice together? If OpenACC directives and CUDA APIs are combined, which one do I have to use?

If I call cudaStreamSynchronize in OpenMP region, does it block all OpenMP threads? Or does it only block the thread that called it?

The host side OpenMP threads would not necessarily block, especially if they are all using separate streams. If they are all using the same stream, then there could be some blocking.

If I generate more blocks than the GPU actually generates, how does the block indexing work?

The program would still generate 10 blocks, but with 5 running at a time. Once the first 5 are finished, the second 5 would run.

Is it safe to use ACC_SET_DEVICE_NUM and cudaSetDevice together? If OpenACC directives and CUDA APIs are combined, which one do I have to use?

These are two different but related things. ACC_SET_DEVICE_NUM sets the default device to use when the program loads, while cudaSetDevice or OpenACC’s equivalent “acc_set_device_num”, changes the device during execution of the program. With a few exceptions, it’s fine to change the device number during execution keeping in mind that data created one device is not directly accessible on another.

Changing the device is only a problem when using static global data that gets created at load time (i.e. global or module variables that use the OpenACC “declare create” directive). In this case, the data is created on the default device but not implicitly copied to the second device (if different).

  • Mat