Getting access to shared memory from different kernels is there a way to share it?

Hi guys,

I’ll try to explain what I’m asking. Imagine I run kernelFoo, each thread of this kernel outputs one value to shared memory. Docs state that shared memory is only accessible from all the threads within one block. So is it possible to get access to the results in shared memory written by different kernels? I.e. is it guaranteed that the results written by various kernels stay in the shared memory between the executions?

Code would look like this:

launch_kernelFoo();

...

launch_kernelNextKernle();// it will be reading values written by kernelFoo into the shared memory

Thanks.

You can read unitialized shared memory, but you will not get any meaningful result. Your second kernel will not get data from first via shared memory.
What if first kernel have more than one thread block? And what if driver will schedule kernel from some other application in between your kernel calls? Answer those questions and you’ll understand why shared memory cannot be used to pass data between kernels (and even between thread blocks of same kernel).

Spasibo Andrei,

I was asking myself these questions. That’s exactly why I asked WHETHER there is a common way to somehow overcome these problems. I guess your answer means - NO. As far as I understand the only option is to write to the global memory, which I was hoping to avoid.

Thanks.

OK!

Say only one application is running, having full control over what CUDA events will be issued. What then if there is only one block per SM and this kernel is the same as the previous kernel, only with next data as argument. I currently save the state of 15 registers per thread as well as the state of shared for continuity between runs. It would be really nice to skip some of that, perhaps even dump some of the registers to unused parts of shared rather than to global.

(This on the surface looks like a clean, well defined case, no?)

EDIT: Never mind, the performance advantage is hardly measurable, if at all …

What about using a “fat” kernel with sync treads, that way you could hide several different kernels within one big(fat) kernel and separate them with synctreads,
having different loop structures in each “sub” kernel could allow you to break up the problem differently…