CUDA: Using shared memory between different kernels..

akslok · April 15, 2011, 5:30pm

Is this possible? I am trying to define a shared memory array in one kernel… and then I need to use those values in a different kernel.

I tried declaring the

extern __shared__ float sharedMem[];

outside all functions and then wrote to it in one kernel and tried to access it in a different kernel. The sharedMem is array is written to properly in the first kernel, but when I try to access it in the second kernel, the values are all 0. So I am guessing this won’t work or I am doing something wrong. Can someone please help me out on this?

thanks!

YDD · April 15, 2011, 5:49pm

Shared memory has the lifetime of a block. There’s no way around this.

SPWorley · April 15, 2011, 6:13pm

If you really need to save per-block information from dynamic shared memory between kernel launchess, you could allocate global memory, equal to the block count times the dynamic shared size.
Kernel #1 would then copy its dynamic shared memory to that block’s global memory at the end of the kernel, and kernel #2 would load that block’s memory from global to dynamic shared at the start of its kernel.

Is this a good idea? Maybe, depending on your algorithm, but not too likely. It’s almost certainly better to analyze your algorithm to see why you need saved state and try to minimize that or reorder computations to reduce or eliminate the large per-block state restoration.

This only works (easily) for dynamic shared memory, since static shared variables can’t be copied with a blind block copy… you could do them manually one by one, but that’s really unlikely to be efficient in both code complexity and time.

hyqneuron · April 16, 2011, 3:18am

Well, it’s not exactly undoable. I managed to make it work. But do this with caution. It is not guaranteed to work on all drivers/devices/multiple streams and contexts. I only tested it using a single thread. Also, my success rely on the fact that the first block launched when my GPU is completely idle always goes to the fixed MP. If my block number becomes greater than 6 (I’m on a GTX 460 which has a problematic allocation of blocks to MPs, otherwise this number should be 7), more complication will arise. Don’t be silly to try this if storing your data through global mem doesn’t cost much more.

In the following code, kernel1 declares shared memory size of 128 bytes and sets the first 32 bytes to 8 values of integers

kernel2 then declares 128 bytes of shared memory and transfers the first 32 bytes to global memory as output

the parameters, output and input, are extra stuffs I used in the process of my experimentation. They proved to be unnecessary as the shared memory allocation seems to always have a base address of 0x0, according to my result from cuobjdump.

.entry kernel1 (.param .u32 output) //kernel 1 stores 8 int to shared mem

{

	.reg .u32 %r<2>;

	.shared .align 4 .b8 localmem[128];

	st.shared.s32 	[localmem+0], 1;

	st.shared.s32	[localmem+4], 2;

	st.shared.s32	[localmem+8], 3;

	st.shared.s32	[localmem+12], 4;

	st.shared.s32 	[localmem+16], 5;

	st.shared.s32	[localmem+20], 6;

	st.shared.s32	[localmem+24], 7;

	st.shared.s32	[localmem+28], 8;

	ld.param.u32 	%r0, [output];

	mov.u32		%r1, localmem;//ignore

	st.global.u32 	[%r0+0], %r1; //ignore

	exit;

} // kernel1

.entry kernel2 (.param .u32 input, .param .u32 gpu_output)

{

	.reg .u32 %r<3>;

	.shared .align 4 .b8 localmem[128];

	ld.param.u32 	%r0, [input];//ignore

	ld.global.u32 	%r0, [%r0]; //ignore

	ld.param.u32 	%r1, [gpu_output]; 

	ld.shared.s32	%r2, [localmem+0];

	st.global.s32 	[%r1+0], %r2;

	ld.shared.s32	%r2, [localmem+4];

	st.global.s32 	[%r1+4], %r2;

	ld.shared.s32	%r2, [localmem+8];

	st.global.s32 	[%r1+8], %r2;

	ld.shared.s32	%r2, [localmem+12];

	st.global.s32 	[%r1+12], %r2;

	ld.shared.s32	%r2, [localmem+16];

	st.global.s32 	[%r1+16], %r2;

	ld.shared.s32	%r2, [localmem+20];

	st.global.s32 	[%r1+20], %r2;

	ld.shared.s32	%r2, [localmem+24];

	st.global.s32 	[%r1+24], %r2;

	ld.shared.s32	%r2, [localmem+28];

	st.global.s32 	[%r1+28], %r2;

	exit;

} // kernel2

EDIT:I have attached all my code for testing in case you would need.

EDIT:I’m surprised that when I remove the unnecessary code related to input and output, ptxas still does not optimize away with all the operations. Good work ptxas! Finally not doing something silly External Image
sharedtest.zip (65.7 KB)

yanbellavance · July 21, 2017, 3:49am

wouldn’t using global memory defeat the whole purpose of accelerating data access?

Topic		Replies	Views
life span of shared memory CUDA Programming and Performance	15	6932	April 27, 2011
How to efficiently use shared memory? CUDA Programming and Performance	2	1163	September 29, 2015
Shared memory between kernels? CUDA Programming and Performance	1	2583	September 23, 2008
Shared Memory Persistence Conditions. CUDA Programming and Performance	3	2316	May 2, 2009
__shared__ memory: Just a question what happens if CUDA Programming and Performance	3	837	March 15, 2016
Shared memory between several kernels CUDA Programming and Performance	6	1776	April 6, 2010
Copy data into shared memory CUDA Programming and Performance	6	1326	May 28, 2009
Use shared Memory CUDA Programming and Performance	3	430	December 26, 2019
Correct Use of Shared Memory? CUDA Programming and Performance	1	711	January 6, 2010
Shared Memory variables ? In multiple kernel invocations CUDA Programming and Performance	2	1878	July 11, 2008

CUDA: Using shared memory between different kernels..

Related topics