Distribute copy (kernel) across multiple SM

Dear all,

Imagine you have Narr=8 device arrays (e.g. src_1 to src_8) and you want to copy them to Narr=8 other arrays (e.g. targ_1 to targ_8). Assume each array has length 1024. I have written a simple kernel (below) that uses shared memory and ILP (instruction-level parallelism). I launch the kernel like

call copy_shared_mem<<<1, Threads_per_Block, shmem>>> (source_d, target_d, Num_Blocks)

Intuitively, Num_Blocks=4 and Threads_per_Block=256.

The copy bandwidth for a single launch of this kernel on a Kepler K40c device is about 6.1 GB/sec (quite small, because occupancy is low).

Then, I create Nstr=Narr streams to launch each kernel through one stream; hence, I copy Narr arrays “simultaneously”, like the following:

ierr = cudaEventRecord(start, str1)
call copy_shared_mem<<<1, Threads_per_Block, shmem, str1>>> (source_1, target_1, Num_Blocks)
call copy_shared_mem<<<1, Threads_per_Block, shmem, str8>>> (source_8, target_8, Num_Blocks)
ierr = cudaEventRecord(finish, str8)
ierr = cudaEventSynchronize(finish)
ierr = cudaEventElapsedTime(delta_t, start, finish)

When I measure the bandwidth, I realize that it is still about 6.1 GB/sec, while I was expecting a factor Nstr times faster (e.g. 48 GB/sec). Apparently, this asynchronous launching of kernels does not distribute across all available SM, instead, they queue to be executed on the same SM, while leaving the other available multiprocessors idle.

Do I understand correctly what I said above?

Is there a possibility to distribute/tie asynchronous kernels launches to multiple SM?

I use PGI/18.4 for Fortran CUDA.

Kind regards,

attributes(global) subroutine copy_shared_mem(src, targ, nb)
    real :: src(:), targ(:)  ! source and target arrays
    integer, value :: nb     ! number of blocks

    real, shared :: temp(nb) ! temporary container
    integer :: i, k, bs

    i = threadIdx%x
    bs = size(src) / nb      ! block size
    do k = 0, nb - 1
       temp(k + 1) = src(i + k * bs)

    do k = 0, nb - 1
       targ(i + k * bs) = temp(k + 1)

  end subroutine copy_shared_mem

using shared memory isn’t going to help here.

when you launch multiple kernels, they may queue, but it is not for the purpose of running on the same SM

rather than going through the complexity of trying launch multiple kernels in streams to copy arrays, I would simply write my kernel code so that it could copy multiple arrays, using many blocks, so as to fill the device with a single kernel call

Thanks Bob for your swift reply.

So, as I understand, it is not possible to distribute one/multiple kernel calls asynchronously among multiple SMs. You confirm?

The point is, sometimes you wanna copy e.g. 5 arrays, sometimes 17 and so on. Thus, it is not the best practice to keep writing a new kernel each time that copies e.g. 5 arrays, and another one that copies 17 arrays and so on … Instead, I rather call my standalone copying kernel the number of times needed per application. That is the reason I started experimenting with test above.

Yet, I am a bit disappointed that I cannot have a better control of the distribution of my kernel launches on the device to fill up the whole resources there, the way I like, not the way the compiler/runtime API decides for me ;-)

the PTX ISA exposes the multiprocessor ID via

.sreg .u32 %smid;

CUDA-C code to access it would look like this

static __device__ uint get_smid(void) {
    uint ret;
    asm("mov.u32 %0, %smid;" : "=r"(ret) );
    return ret;

you could attempt to use PTX inline assembly (if supported by fortran) to grab the multiprocessor ID to validate your hypothesis.

You could also attempt to launch grids that are able to fill the GPU entirely, but each grid would only have one active block that performs work on a specific multiprocessor (the other blocks do an early exit). Maybe with this way you would be able to measure a higher bandwidth and get your blocks to run at the same time.

No, I don’t confirm. I don’t know exactly what is happening in your case, however I suspect that the kernels are serializing because of the kernel launch latency. Unless your arrays are large, its possible that a kernel launch completes in the 10-20 microseconds it takes to get the next kernel launched. If you really want to understand what is happening, you should learn to use one of the available GPU profilers.

It is possible for one kernel call to distribute blocks across multiple SMs (that is the main point of the CUDA model), and it is also possible for multiple kernel calls to distribute their blocks. However, actually witnessing kernel concurrency in a real world application is more difficult than it may seem.

I can write a single kernel that can copy either 5 arrays, or 17 arrays, efficiently. If it is your preference not to do so, you’re welcome to do whatever you wish. Not every possible approach will yield the best performance on a GPU, however.

Spending a lot of time thinking about SMs is a broken way to think about GPU programming, especially for basic tasks. Yes, the GPU has SMs. Don’t worry about that. Write kernels that inherently make efficient use of the GPU, for the work assigned to them.


That would be interesting if I could read your mind more vividly. Don’t you mind putting a simple code snippet here that copies any number of arrays through a single kernel launch, and also fills up the whole device?
I appreciate that.

I checked the generated .ptx file, and there is no mention of a function similar to above. Not sure if the PGI/Fortran/CUDA compiler really does it.