One can dynamically allocate shared memory on the device from host code at kernel launch time, by means of, e.g., providing the third argument to the kernel launch statement, as in:
where sizeOfSharedMemMyKernelExpects is the number I’m writing about.
This shared memory is available as an array, similar to
extern char sharedMem;
in the device code.
Question:
Is there a way to QUERY for sizeOfSharedMemMyKernelExpects parameter in the device code? I would expect smth. similar to how one can query for the number of threads (blockDim built-in), or some kind of a cuda function.
Specifically, my kernel requires a certain amount of dynamic shared memory (workspace) to run, depending on parameters it has been invoked with (param1, param2, … in the above example). This amount depends non-trivially on the parameters. I would like to incorporate an assert kind of code into the kernel, that complains when the amount of shared memory allocated dynamically is insufficient.
You can get the same behavior (that is, without simply using a parameter) by creating a global constant variable, and calling cudaMemcpyToSymbol() right before your kernel call. See the end of Section 4.5.2.3
Why? The only way I can think for us to implement it would to just pass a parameter as well and then hide it from the programmer, so it probably makes more sense to do it yourself.
Frankly, I don’t quite understand tmurray’s argument. Probably, this is because I’m not very familiar with Cuda’s internals.
I would imagine, that the host code needs to communicate to the device the amount of the shared memory to be dynamically allocated at kernel launch. This amount is zero in most cases (according to tmurray), but still…
If the above statement is true, then I don’t see why Cuda can’t expose this value through a device-space function. I don’t think this function needs to be extremely efficient: if the function retrieves the relevant variable from constant memory (as Alex suggested earlier in this thread), it would probably be sufficiently fast.
Again, why? If it’s 0 in 95% of cases (in my experience, this is the case), there would be additional latency on kernel launches from copying that one parameter into constant memory. In other words, there would be a performance hit from this when it does nothing most of the time. Why is this a good thing?