Imagine you have Narr=8 device arrays (e.g. src_1 to src_8) and you want to copy them to Narr=8 other arrays (e.g. targ_1 to targ_8). Assume each array has length 1024. I have written a simple kernel (below) that uses shared memory and ILP (instruction-level parallelism). I launch the kernel like
call copy_shared_mem<<<1, Threads_per_Block, shmem>>> (source_d, target_d, Num_Blocks)
Intuitively, Num_Blocks=4 and Threads_per_Block=256.
The copy bandwidth for a single launch of this kernel on a Kepler K40c device is about 6.1 GB/sec (quite small, because occupancy is low).
Then, I create Nstr=Narr streams to launch each kernel through one stream; hence, I copy Narr arrays “simultaneously”, like the following:
ierr = cudaEventRecord(start, str1) call copy_shared_mem<<<1, Threads_per_Block, shmem, str1>>> (source_1, target_1, Num_Blocks) ... call copy_shared_mem<<<1, Threads_per_Block, shmem, str8>>> (source_8, target_8, Num_Blocks) ierr = cudaEventRecord(finish, str8) ierr = cudaEventSynchronize(finish) ierr = cudaEventElapsedTime(delta_t, start, finish)
When I measure the bandwidth, I realize that it is still about 6.1 GB/sec, while I was expecting a factor Nstr times faster (e.g. 48 GB/sec). Apparently, this asynchronous launching of kernels does not distribute across all available SM, instead, they queue to be executed on the same SM, while leaving the other available multiprocessors idle.
Do I understand correctly what I said above?
Is there a possibility to distribute/tie asynchronous kernels launches to multiple SM?
I use PGI/18.4 for Fortran CUDA.
attributes(global) subroutine copy_shared_mem(src, targ, nb) real :: src(:), targ(:) ! source and target arrays integer, value :: nb ! number of blocks real, shared :: temp(nb) ! temporary container integer :: i, k, bs i = threadIdx%x bs = size(src) / nb ! block size do k = 0, nb - 1 temp(k + 1) = src(i + k * bs) enddo do k = 0, nb - 1 targ(i + k * bs) = temp(k + 1) enddo end subroutine copy_shared_mem