Recover shared memory used for parameter passing

I’m developing an application that would optimally use all 16K of shared memory for data, but since CUDA passes global function arguments via the shared memory, the full 16K isn’t available.

Is there any way to recover the shared memory used for argument passing after copying the arguments to thread registers?

How about if you simply passed all parameters via constant memory? Then your kernel will have no arguments.

That sounds like a good option.

Is there a performance penalty in writing the values to constant memory and accessing them vs. CUDA putting the parameter values in shared memory before the kernel starts? If so, what would the penalty be?

You can’t use 16k shared mem even if you don’t have parameter. Some is reserved for blockIdx, threadIdx and stuff
There shouldn’t be much penalty of passing via const unless there’re divergently-read look-up tables.

asadafag: I’m not entirely sure yet if these count towards the 16k limit or are in their own private memory space

CUDA adds a 16-byte shared memory overhead to all kernels. But I’m not sure what it contains. My guess is that threadIdx, blockIdx and other such variables are stored in registers, not shared memory.

In disassembled cubin it appears those parameters (except for threadIdx, which is passed in register r0) are in a separately addressed piece of shared memory, which is read-only. It is completely separate from the rw memory we call ‘shared memory’

0x0: "%gridflags", # lower u16 is gridid

0x1: "%ntid.x",    # checked

0x2: "%ntid.y",

0x3: "%ntid.z",

0x4: "%nctaid.x",

0x5: "%nctaid.y",

0x6: "%ctaid.x",

0x7: "%ctaid.y",

0x8: "%ctaid.z",  # extrapolated

0x9: "%nctaid.y",  # ptx ISA

Parameters start at offset 0x10 (4*0x4) of the writable shared memory area. Other declared shared memory variables are immediatly after that. As to what is at offset 0x00 I don’t know. These are indeed 16 bytes overhead. I should try reading them out some time and see if they contain the same as the %blockIdx registers.

I once tried to use 16k for scan, and it indeed failed. A simple test confirmed it:

#include <stdio.h>

__global__ void aaa(int *a){

	a[0]=7777;

}

int main(){

	int *a,b;

	cudaMalloc((void**)&a,4);

	cudaMemset(a,0,4);

	aaa<<<1,1,16384>>>(a);

	cudaMemcpy(&b,a,4,cudaMemcpyDeviceToHost);

	printf("%d\n",b);

	return 0;

}

The kernel launch failed.

Fascinating. What variables are in 0xa, 0xb, …, 0xf?

Seems I was completely wrong about my separate registers, the block parameters are in shared memory just like anything else, and thereby, writable too:

#include <cuda_runtime_api.h>

#include <cuda.h>

#include <algorithm>

__global__ void my_kernel(uint16_t *data)

{

extern __shared__ uint16_t x[];

    x[-16] = 0x1234;

    __syncthreads();

    data[0] = x[-16]; // %gridflags

    data[1] = x[-15]; // %ntid.x

    data[2] = x[-14]; // %ntid.y

    data[3] = x[-13]; // %ntid.z

    data[4] = x[-12]; // %nctaid.x

    data[5] = x[-11]; // %nctaid.y

    data[6] = x[-10]; // %ctaid.x

    data[7] = x[-9];  // %ctaid.y

}

int main()

{

    int width = 8;

    int size = width*4;

   uint16_t *data, *gdata;

   cudaMalloc((void**)&gdata, size);

    data = (uint16_t*)malloc(size);  

   for(int x=0; x<8; ++x)

    {

        dim3 block_size(1,2,1);

        dim3 grid_size(8,8,1); 

        int shared_size = 0;  

       my_kernel<<<grid_size, block_size, shared_size>>>(gdata);

       cudaMemcpy((void*)data, (void*)gdata, size, cudaMemcpyDeviceToHost);

       for(int x=0; x<width; ++x)

            printf("%04x ", data[x]);

        printf("\n");

    }

   return 0;

}

So… yes, the first 16 (8*2) bytes of shared memory are the block parameters, and you can never get rid of that overhead. You can overwrite them though if you feel really lucky :P (by indexing into negative shared memory)

Parameter 0x9, 0xA, 0xB etc don’t exist, those would be the normal (user parameters).

Edit: I updated decude to reflect accordingly

Hihi, nice one :w00t:

Peter

I played around with this some and I don’t think these are missing parameters. I think that the compiler puts the dynamic shared memory after the kernel parameters and then aligns it to the next 16-byte boundary. Since you have one parameter, it skips three words and then starts the dynamic shared segment.