According to Cuda specification, the kernel launch is asynchronous. It allows host-device concurrent execution conveniently. I try to exploit the mechanism but unsuccessful. The code segment reads like:
do ns = 1, STEPS
Xdev = X
Ydev = Y
call code_dev<<<NB, NBT>>>(N, N0, Xdev, Ydev,…)
! Host code
call code_host(N, X, Y, …)
istat = cudaDeviceSynchronize()
print*, end - begin
The problem is that: the kernel seems to be launched synchronously, and returns to the host only after completing its execution. As a result, the executions of the host and device codes are not concurrent.
I am really puzzled by the behavior. Do I mis-understand the Cuda specification?
More information about the code: both the host and device codes takes ~1 second to complete. They are not sharing any data except a few parameters, which are declared as value attribute.