Copy data into shared memory

Hi,
i was thinking that it would be really useful to be able to copy something into the shared memory before kernel execution, like from CPU or from another kernel…
to avoid performing a gst/gld when i know i will need those values just after.

Is there a way to allocate and copy data into a “static” portion of the shared memory, so that it doesn’t get deleted when i call a new Kernel?

Thanks!

to me it sounds like what you want is actually constant memory type. constant data is preserved across kernel invocations and is as fast as shared memory if your threads read data in a coherent way.

Hmm, yes, but only partially:

"[i]constant variables cannot be assigned to from the device, only from the

host through host runtime functions[/i]"

So i can’t use it to store the results of a kernel for another one… a thing that could be really interesting performance-wise.

Also, what is the size of constant memory? I can’t find it anywhere.

what you could try is to read uninitialized data inside the kernel after you’ve executed another kernel which wrote into that shared memory area.
I doubt that the whole shared memory block is cleared before each kernel invocation, so this might actually work if you keep shared memory size per treadblock the same (and same set of input arguments). Also you do not have guarantee, that all shared memory will be touched by your ‘writing’ kernel, so you’d probably want to take that into account.

If you are going to experiment - write about your findings here.

size of constant memory is 64kb

And also you need an exclusive access to the device to make that work, since your algorithm most probably will not like if someone else kicks in and uses the device to render gui for instance.

yes i think that this would be really an hack, even if i could get it to work it won’t for sure work reliably, so i don’t think i will even try.

BTW it would be useful if cuda allowed for a “Kernel Chain” where each kernel keeps the shared memory of the preceding… but maybe it would have too many limitations (es: block number)

Be warned about constant memory, it is not exactly as fast as shared!

Quote from Programing Guide: