Querying amount of shared memory allocated

Hello,

Background:

One can dynamically allocate shared memory on the device from host code at kernel launch time, by means of, e.g., providing the third argument to the kernel launch statement, as in:

myKernel<<<grid, threads, sizeOfSharedMemMyKernelExpects>>>(param1, param2, param3, param4);

where sizeOfSharedMemMyKernelExpects is the number I’m writing about.

This shared memory is available as an array, similar to

extern char sharedMem;

in the device code.

Question:

Is there a way to QUERY for sizeOfSharedMemMyKernelExpects parameter in the device code? I would expect smth. similar to how one can query for the number of threads (blockDim built-in), or some kind of a cuda function.

Specifically, my kernel requires a certain amount of dynamic shared memory (workspace) to run, depending on parameters it has been invoked with (param1, param2, … in the above example). This amount depends non-trivially on the parameters. I would like to incorporate an assert kind of code into the kernel, that complains when the amount of shared memory allocated dynamically is insufficient.

Thank you!

Unfortunately there isn’t (see Section 4.2.4).

You can get the same behavior (that is, without simply using a parameter) by creating a global constant variable, and calling cudaMemcpyToSymbol() right before your kernel call. See the end of Section 4.5.2.3

This is a very very valid question… However, I dont think there is any API to do this.

NVIDIA guys could look into it.

Why? The only way I can think for us to implement it would to just pass a parameter as well and then hide it from the programmer, so it probably makes more sense to do it yourself.

oh yeah…Thats the simplest way… Dint strike me at all…

You could make that exact argument for blockDim, gridDim, and warpSize.

I could, but blockDim/gridDim are fundamental to any algorithm. Dynamic shmem allocation isn’t and in a lot of cases wouldn’t be doing anything.

Alex, thank you for a very detailed answer!

Frankly, I don’t quite understand tmurray’s argument. Probably, this is because I’m not very familiar with Cuda’s internals.

I would imagine, that the host code needs to communicate to the device the amount of the shared memory to be dynamically allocated at kernel launch. This amount is zero in most cases (according to tmurray), but still…

If the above statement is true, then I don’t see why Cuda can’t expose this value through a device-space function. I don’t think this function needs to be extremely efficient: if the function retrieves the relevant variable from constant memory (as Alex suggested earlier in this thread), it would probably be sufficiently fast.

Thanks again for your input!

Again, why? If it’s 0 in 95% of cases (in my experience, this is the case), there would be additional latency on kernel launches from copying that one parameter into constant memory. In other words, there would be a performance hit from this when it does nothing most of the time. Why is this a good thing?