CUDA: Using shared memory between different kernels..

Is this possible? I am trying to define a shared memory array in one kernel… and then I need to use those values in a different kernel.

I tried declaring the

extern __shared__ float sharedMem[];

outside all functions and then wrote to it in one kernel and tried to access it in a different kernel. The sharedMem is array is written to properly in the first kernel, but when I try to access it in the second kernel, the values are all 0. So I am guessing this won’t work or I am doing something wrong. Can someone please help me out on this?


Shared memory has the lifetime of a block. There’s no way around this.

If you really need to save per-block information from dynamic shared memory between kernel launchess, you could allocate global memory, equal to the block count times the dynamic shared size.
Kernel #1 would then copy its dynamic shared memory to that block’s global memory at the end of the kernel, and kernel #2 would load that block’s memory from global to dynamic shared at the start of its kernel.

Is this a good idea? Maybe, depending on your algorithm, but not too likely. It’s almost certainly better to analyze your algorithm to see why you need saved state and try to minimize that or reorder computations to reduce or eliminate the large per-block state restoration.

This only works (easily) for dynamic shared memory, since static shared variables can’t be copied with a blind block copy… you could do them manually one by one, but that’s really unlikely to be efficient in both code complexity and time.

Well, it’s not exactly undoable. I managed to make it work. But do this with caution. It is not guaranteed to work on all drivers/devices/multiple streams and contexts. I only tested it using a single thread. Also, my success rely on the fact that the first block launched when my GPU is completely idle always goes to the fixed MP. If my block number becomes greater than 6 (I’m on a GTX 460 which has a problematic allocation of blocks to MPs, otherwise this number should be 7), more complication will arise. Don’t be silly to try this if storing your data through global mem doesn’t cost much more.

In the following code, kernel1 declares shared memory size of 128 bytes and sets the first 32 bytes to 8 values of integers

kernel2 then declares 128 bytes of shared memory and transfers the first 32 bytes to global memory as output

the parameters, output and input, are extra stuffs I used in the process of my experimentation. They proved to be unnecessary as the shared memory allocation seems to always have a base address of 0x0, according to my result from cuobjdump.

.entry kernel1 (.param .u32 output) //kernel 1 stores 8 int to shared mem


	.reg .u32 %r<2>;

	.shared .align 4 .b8 localmem[128];

	st.shared.s32 	[localmem+0], 1;

	st.shared.s32	[localmem+4], 2;

	st.shared.s32	[localmem+8], 3;

	st.shared.s32	[localmem+12], 4;

	st.shared.s32 	[localmem+16], 5;

	st.shared.s32	[localmem+20], 6;

	st.shared.s32	[localmem+24], 7;

	st.shared.s32	[localmem+28], 8;

	ld.param.u32 	%r0, [output];

	mov.u32		%r1, localmem;//ignore 	[%r0+0], %r1; //ignore


} // kernel1

.entry kernel2 (.param .u32 input, .param .u32 gpu_output)


	.reg .u32 %r<3>;

	.shared .align 4 .b8 localmem[128];

	ld.param.u32 	%r0, [input];//ignore 	%r0, [%r0]; //ignore

	ld.param.u32 	%r1, [gpu_output]; 

	ld.shared.s32	%r2, [localmem+0]; 	[%r1+0], %r2;

	ld.shared.s32	%r2, [localmem+4]; 	[%r1+4], %r2;

	ld.shared.s32	%r2, [localmem+8]; 	[%r1+8], %r2;

	ld.shared.s32	%r2, [localmem+12]; 	[%r1+12], %r2;

	ld.shared.s32	%r2, [localmem+16]; 	[%r1+16], %r2;

	ld.shared.s32	%r2, [localmem+20]; 	[%r1+20], %r2;

	ld.shared.s32	%r2, [localmem+24]; 	[%r1+24], %r2;

	ld.shared.s32	%r2, [localmem+28]; 	[%r1+28], %r2;


} // kernel2

EDIT:I have attached all my code for testing in case you would need.

EDIT:I’m surprised that when I remove the unnecessary code related to input and output, ptxas still does not optimize away with all the operations. Good work ptxas! Finally not doing something silly :clap: (65.7 KB)

wouldn’t using global memory defeat the whole purpose of accelerating data access?