Is kernel launch really asynchronous in Cuda Fortran?

According to Cuda specification, the kernel launch is asynchronous. It allows host-device concurrent execution conveniently. I try to exploit the mechanism but unsuccessful. The code segment reads like:

do ns = 1, STEPS
Xdev = X
Ydev = Y

call cpu_time(begin)
call code_dev<<<NB, NBT>>>(N, N0, Xdev, Ydev,…)
call cpu_time(end)

! Host code
call code_host(N, X, Y, …)

istat = cudaDeviceSynchronize()
print*, end - begin
end do

The problem is that: the kernel seems to be launched synchronously, and returns to the host only after completing its execution. As a result, the executions of the host and device codes are not concurrent.

I am really puzzled by the behavior. Do I mis-understand the Cuda specification?

More information about the code: both the host and device codes takes ~1 second to complete. They are not sharing any data except a few parameters, which are declared as value attribute.

It turns out that when the device code is executing, the host is blocked by a call to cudafree(). It seems the compiler automatically insert a call to cudafree() when launching the kernel, for some reasons.

The compiler does compile the example code async.cuf well, and the resulting executable does launch the kernel asynchronously as expected.

The difference between my code and the example code is in complexitiy. The ptxas info looks like:

ptxas info : 56 bytes gmem, 4132 bytes cmem[2], 8 bytes cmem[14]

It seems the compiler will insert a call to cudafree() for a kernel subroutine using too much resource. The launch of the kernel will become effectively synchronous.

Is this a bug or a feature?

Hi Junren,

The call is launched asynchronously though given the example it’s unclear why a cudaFree would be added. Can you post or send to PGI Customer Service (trs@pgroup.com) a reproducing example?

Thanks,
Mat

Hi Mat,

Thanks!

I think I have figured out the cause. It is due to assumed-shape array declaration in the kernel subroutine. The behavior can be reproduced by changing the example async.cuf that comes with the compiler.

The header of the original kernel subroutine reads:

attributes(global) subroutine kernel(a, offset)
implicit none
real :: a(*)

By changing it to:

attributes(global) subroutine kernel(a, offset)
implicit none
real :: a(:)

I find that the kernel launch will be accompanied by a call to cudaFree(), and blocks the host execution.

Junren

Ok, we’re most likely having to create a temporary device side descriptor for the assumed-shape array.