Limited concurrency

I’m checking the async execution of a simple saxpy using Cuda Fortran. I got the profiling in the picture, but I was expecting more concurrency between GPU and CPU. Can you guys see if I’m missing something?

module mathOps
contains
attributes(global) subroutine saxpy(x, y, a, N)
implicit none
real :: x(N), y(N)
real, value :: a
integer :: i, n
!
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps

program testSaxpy
use mathOps
use cudafor
USE nvtx ! CUDA profiling tools
implicit none
integer, parameter :: N = 400000
real , DIMENSION(:) , ALLOCATABLE, pinned :: x, y
real :: a
real, device :: x_d(N), y_d(N)
integer :: istat, i
type(dim3) :: grid, tBlock
INTEGER(kind=cuda_stream_kind) :: stream(3), str ! Stream ID
!
allocate(x(N),y(N))
DO i = 1, 3
istat = cudaStreamCreate(stream(i))
IF(istat /= 0) print , ‘Error in Stream creation’, i
END DO
str = stream(1)
!
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
!
x = 1.0; y = 2.0; a = 2.0
!
CALL nvtxStartRange(“cpy”,1)
istat = cudaMemcpyAsync(x_d, x, N , str )
istat = cudaMemcpyAsync(y_d, y, N , str )
CALL nvtxEndRange
CALL nvtxStartRange(“krn”,2)
call saxpy<<<grid, tBlock,0, str>>>(x_d, y_d, a)
CALL nvtxEndRange
CALL nvtxStartRange(“cpyb”,3)
istat = cudaMemcpyAsync(y, y_d, N , str )
CALL nvtxEndRange
CALL nvtxStartRange(“cpu”,4)
write(
,*) 'Max error: ', maxval(abs(x-1))
CALL nvtxEndRange
end program testSaxpy

Please provide more details on what type of concurrency you were expecting.

The CPU launches the cudaMemcpyAsync(s), kernel, and final cudaMemcpyAsync. The GPU may be slightly stalled on the kernel launch as you can see a ~40µs from CPU launch to start of the kernel.

The kernel computation counts 7us but the call of the kernel takes much more. In yellow indicates the async kernel calls took 53us.

The markers show the time waste on CPU, the Context 1 (CUDA) the time on GPU. Here are the values:

CPU nvtx markers  |  GPU Context 1  
cpy -> 42us              |  H2D     -> 43us 
krn -> 53us              |  kernel ->  7us
cpyb -> 15us            |  D2H     -> 44us

I was expecting some dozens of nano second spend on CPU as I was using async calls. But I spend as much time in CPU as in GPU computing.

At first glance I though the bad async times were related to GPU context change, but if I double the workload(N=800000) I got even worst results.

CPU nvtx markers       |  GPU Context 1  
cpy -> 2450us          |  H2D     -> 110us 
krn -> 52us            |  kernel ->  11us
cpyb -> 15us           |  D2H     ->  82us

just an update. For the N=800000 the huge time increase is related to x initialization. The new numbers would be:

CPU nvtx markers       |  GPU Context 1  
cpy  -> 41us           |  H2D     -> 110us 
krn  -> 52us           |  kernel ->  11us
cpyb -> 15us           |  D2H     ->  82us

There is more superposition in this case

In that case, I can reformulate the question. What defines the CPU block time on async CUDA/GPU calls?

CUPTI is likely adding several µsecs to NVTX range and CUDA API calls. The driver has a number of operations to perform on a memory copy that will make it much larger than “dozens of nanoseconds”. The driver has to:

  • verify the parameters
  • determine properties of the source address (context, type of memory)
  • determine properties of the destination address (context, type of memory)
  • determine optimal transfer path for (src, dst, size vs. device cfg)
  • issue commands (via a HAL)

I would have thought the overhead closer to 10-20 µs for the memory copy. I do not know how much of the difference between NVTX range and first copy is due to the trace tool.

Thank you Greg for the information. I have higher workloads to check so the async run can offer more advantages.