How to syncronize across blocks in CUDA Fortran

In CUDA C, I use cudaThreadSynchronize() to sync across different blocks within a device kernel function.
But PGI compiler always gives the following error when I use it:

PGF90-S-0155-Calls from device code to a host function are allowed only in emulation mode - cudathreadsynchronize (fcuda.F: 402)

I do include use cudafor, but do not use module and contains as the example in PGI programming guide.

Hi jack_laconic,

“cudaThreadSynchronize” is a host side routine so can’t be used in a device kernel. Typically it’s called in the host code just after a kernel launch to ensure that all threads in a kernel have finished before the host code continues. I believe this is same behavior as CUDA C.

Within a kernel, you can use ‘syncthreads’ to synchronize threads within the same block Also, a call to ‘threadfence’ will ensure that a write to global memory is visible to all threads on the device.

  • Mat

Hi Mat,

I guess the short answer is no.
Threads across blocks can’t be synchronized within a kernel.
Alternatively, I can use another kernel to continue the work on global memory.

I was wrong about CUDA C cudaThreadSynchronize() function.
It is a host function.

BTW, I did not find the description of threadfence on PGI CUDA Fortran programming guide.
Could you tell me where I can find it?

Jack

Hi Jack,

BTW, I did not find the description of threadfence on PGI CUDA Fortran programming guide. Could you tell me where I can find it?

Sorry about that. The docs haven’t caught up with the implementation. threadfence was just added in PGI version 10.6 (See the PGI Release Notes http://www.pgroup.com/doc/pgiwsrn.pdf). Though, we follow CUDA C so here’s the description from the CUDA C 3.1 programming guide.

B.5 Memory Fence Functions void __threadfence_block();
waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block. void __threadfence();
waits until all global and shared memory accesses made by the calling thread prior to __threadfence() are visible to:

  • All threads in the thread block for shared memory accesses,
  • All threads in the device for global memory accesses. void __threadfence_system();
    waits until all global and shared memory accesses made by the calling thread prior to __threadfence_system() are visible to:
  • All threads in the thread block for shared memory accesses,
  • All threads in the device for global memory accesses,
  • Host threads for page-locked host memory accesses (see Section 3.2.6.3).
    __threadfence_system() is only supported by devices of compute capability 2.0.

What threadfence allows you do is make sure that any changes you’ve made to global memory are visible to all threads across all blocks.

Hope this helps,
Mat