Kernel uses extra 16 bytes shared mem, no kernel arguments

I have a kernel using 2 shared arrays that have a total size of 4096 bytes, which should allow 4 blocks per multiprocessor (I’m using a Tesla C1060, 16384 shared mem, max 1024 threads per sm). However, the profiler tells me I’m using 4112 bytes of shared memory per block, and thus, I’m only getting 3 blocks per multiprocessor. My blocks have 256 threads so I’m only getting 75% occupancy.

Originally, the profiler said I was using 4144 bytes of shared memory per block, but once I learned that kernel arguments are passed using shared memory, I changed the code so that no arguments are passed – I declared my device arrays at the top of the module that contains both the kernel and the routine that calls it, instead of passing them as arguments.

Is there anything else that uses shared memory besides variables I declare as “shared”? Where is the extra 16 bytes of shared memory coming from?

Are the built-in variables like blockID and blockdim stored in shared memory? Where is threadID stored?

Thanks!

Hi JDS7,

Most likely the extra bytes are holding the F90 Array descriptors. Are you using allocatable arrays?

  • Mat

Hi Mat,

I’m not using allocatable device arrays. The host arrays are allocatable.
My code looks something like this:

module block_and_array_sizes
! (contains parameters for block size and array size)
end module


module mod
  use block_and_array_sizes

  ! declare constants
  ! declare device arrays (sizes set by parameter values)

  contains


  attributes(global) subroutine kernel()

    ! declare two shared arrays (sizes set by parameter values)
    
    ! copy subset of data from device arrays to shared arrays
    ! copy one piece of data from device array to local variable (should be in a register now, right?)

    ! perform calculation

    ! copy result back to device arrays

  end subroutine kernel


  subroutine I_call_the_kernel(host arrays and other parameters)

    ! copy data from host arrays to device arrays
    ! copy constant variables to device constants

    call kernel<<<dimGrid>>>()


    ! copy data from device arrays to host arrays

  end subroutine I_call_the_kernel(
end module


Program MAIN
use mod

  ! create and initialize host arrays

  call I_cal_the_kernel(host arrays and other parameters)

Is there any way to stop using the extra 16 bytes of shared memory per block, without changing the block size? I’d really like to get 4 blocks on each SM.

Thanks

Jim

Hi Jim,

I asked our engineers about this. It this case, it’s not us adding the extra few bytes but rather the system. They said that you should assume that there will always be a small amount of overhead.

Sorry,
Mat

Thanks for looking into it Mat.

Do you know what the system is actually doing with that overhead?

Hi Jim,

Do you know what the system is actually doing with that overhead?

I do not, but I sent a note to one of my contacts at NVIDIA to see if he knows. I’ll post once I hear back from him.

  • Mat

Hi Jim,

Apparently on the C1060, the predefined variables, such as blockidx, blockdim, etc., along with the dummy arguments, are also put into shared memory. This most likely accounts for the extra bytes.

  • Mat

That’s good to know. Thanks Mat.