Limited concurrency

maiconfaria · September 26, 2020, 3:53pm

I’m checking the async execution of a simple saxpy using Cuda Fortran. I got the profiling in the picture, but I was expecting more concurrency between GPU and CPU. Can you guys see if I’m missing something?

module mathOps
contains
attributes(global) subroutine saxpy(x, y, a, N)
implicit none
real :: x(N), y(N)
real, value :: a
integer :: i, n
!
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps

program testSaxpy
use mathOps
use cudafor
USE nvtx ! CUDA profiling tools
implicit none
integer, parameter :: N = 400000
real , DIMENSION(:) , ALLOCATABLE, pinned :: x, y
real :: a
real, device :: x_d(N), y_d(N)
integer :: istat, i
type(dim3) :: grid, tBlock
INTEGER(kind=cuda_stream_kind) :: stream(3), str ! Stream ID
!
allocate(x(N),y(N))
DO i = 1, 3
istat = cudaStreamCreate(stream(i))
IF(istat /= 0) print , ‘Error in Stream creation’, i
END DO
str = stream(1)
!
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
!
x = 1.0; y = 2.0; a = 2.0
!
CALL nvtxStartRange(“cpy”,1)
istat = cudaMemcpyAsync(x_d, x, N , str )
istat = cudaMemcpyAsync(y_d, y, N , str )
CALL nvtxEndRange
CALL nvtxStartRange(“krn”,2)
call saxpy<<<grid, tBlock,0, str>>>(x_d, y_d, a)
CALL nvtxEndRange
CALL nvtxStartRange(“cpyb”,3)
istat = cudaMemcpyAsync(y, y_d, N , str )
CALL nvtxEndRange
CALL nvtxStartRange(“cpu”,4)
write(,*) 'Max error: ', maxval(abs(x-1))
CALL nvtxEndRange
end program testSaxpy

Greg · September 26, 2020, 10:13pm

Please provide more details on what type of concurrency you were expecting.

The CPU launches the cudaMemcpyAsync(s), kernel, and final cudaMemcpyAsync. The GPU may be slightly stalled on the kernel launch as you can see a ~40µs from CPU launch to start of the kernel.

maiconfaria · September 28, 2020, 9:17am

The kernel computation counts 7us but the call of the kernel takes much more. In yellow indicates the async kernel calls took 53us.

The markers show the time waste on CPU, the Context 1 (CUDA) the time on GPU. Here are the values:

CPU nvtx markers  |  GPU Context 1  
cpy -> 42us              |  H2D     -> 43us 
krn -> 53us              |  kernel ->  7us
cpyb -> 15us            |  D2H     -> 44us

I was expecting some dozens of nano second spend on CPU as I was using async calls. But I spend as much time in CPU as in GPU computing.

At first glance I though the bad async times were related to GPU context change, but if I double the workload(N=800000) I got even worst results.

CPU nvtx markers       |  GPU Context 1  
cpy -> 2450us          |  H2D     -> 110us 
krn -> 52us            |  kernel ->  11us
cpyb -> 15us           |  D2H     ->  82us

maiconfaria · September 28, 2020, 9:29am

just an update. For the N=800000 the huge time increase is related to x initialization. The new numbers would be:

CPU nvtx markers       |  GPU Context 1  
cpy  -> 41us           |  H2D     -> 110us 
krn  -> 52us           |  kernel ->  11us
cpyb -> 15us           |  D2H     ->  82us

There is more superposition in this case

In that case, I can reformulate the question. What defines the CPU block time on async CUDA/GPU calls?

Greg · September 30, 2020, 8:36pm

CUPTI is likely adding several µsecs to NVTX range and CUDA API calls. The driver has a number of operations to perform on a memory copy that will make it much larger than “dozens of nanoseconds”. The driver has to:

verify the parameters
determine properties of the source address (context, type of memory)
determine properties of the destination address (context, type of memory)
determine optimal transfer path for (src, dst, size vs. device cfg)
issue commands (via a HAL)

I would have thought the overhead closer to 10-20 µs for the memory copy. I do not know how much of the difference between NVTX range and first copy is due to the trace tool.

maiconfaria · October 6, 2020, 8:58am

Thank you Greg for the information. I have higher workloads to check so the async run can offer more advantages.

Topic		Replies	Views
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1326	November 19, 2010
asyncApi example -- does it work for anyone? CUDA Programming and Performance	3	7928	February 21, 2011
Issue with running CPU and GPU code Asynchronously CUDA Programming and Performance	0	3198	June 8, 2011
Asynchronous performance between CPU and GPU CUDA Programming and Performance	3	2458	June 18, 2012
Overlapping GPU and CPU computation? CUDA Programming and Performance	5	4353	November 15, 2010
performance variation when using asynchronous calls CUDA Programming and Performance	1	661	February 11, 2011
Profiler, GPU/CPU time CUDA Programming and Performance	0	2586	January 29, 2009
CUDA Profiler CUDA Programming and Performance	7	12820	October 18, 2010
Multiple threads calling CUDA API in parallel CUDA Programming and Performance cuda , driver , parallel-computing	4	662	August 9, 2024
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20369	May 4, 2007

Limited concurrency

Related topics