Assigning from shared to global memory Question about global memory and assigning complex statements

blondon · July 30, 2009, 6:18pm

This is a quick (and perhaps dumb) question. Say I have a kernel function:

__global__ void foo_inline(unsigned int n, float* lhs, const float* rhs, const float alpha) {

	__shared__ float s_data[];

	float* s_lhs = s_data;

	float* s_rhs = &s_data[n];

	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < n) {

		s_lhs[threadIdx.x] = lhs[i];

		s_rhs[threadIdx.x] = rhs[i];

		lhs[i] = s_lhs[threadIdx.x] * alpha * s_rhs[threadIdx.x];

	}

}

I am copying the arrays into shared memory so as to reduce the access time, but I have inlined the computation and the assignment back out to global memory. My question is, is this the same as storing the result in shared memory and then assigning to global memory? i.e.

__global__ void foo(unsigned int n, float* lhs, const float* rhs, const float alpha) {

	__shared__ float s_data[];

	float* s_lhs = s_data;

	float* s_rhs = &s_data[n];

	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < n) {

		s_lhs[threadIdx.x] = lhs[i];

		s_rhs[threadIdx.x] = rhs[i];

		s_lhs[threadIdx.x] *= alpha * s_rhs[threadIdx.x];

		lhs[i] = s_lhs[threadIdx.x];

	}

}

It would seem that assigning the result back to shared memory is unnecessary, and that this will yield the same efficiency (perhaps even less, because we need to assign to a shared memory space), but I just wanted to make sure.

Thanks!

paulius · July 30, 2009, 8:22pm

Since you’re not reusing any of the data from shared memory, there’s really no point to load it into shared memory first.
Also, whenever you’re not sure about performance benefit, use measurements with a timing experiment to find out which approach is best.

blondon · July 30, 2009, 9:33pm

I see what you’re saying, and I was of the same position. But all of what I read online and in the CUDA examples says that shared memory should be used for anything more than a simple assignment. I mean, if the examples advise to use shared memory for something as trivial as a matrix transpose – which is really just assigning from one index to another – then there must be a reason to load into shared memory for all operations. No?

EDIT: You’re absolutely right, though. Changing back to using just global memory has little effect, according to my timing results. I guess it might make more sense if one of the variables were a scalar that only needed to be read once and used by all threads.

avidday · July 31, 2009, 5:59am

Memory coalescing is prime reason to use shared memory in “trivial” cases. On a compute 1.1 device, this:

i = threadIdx.x;

results[i] = a[i] + b[i] + c[i];

allows coalesced loads and stores, whereas this:

i = threadIdx.x;

results[i-2] = a[i] + b[i-1] + c[i-2]

does not.

Topic		Replies	Views
Efficient way of doing? CUDA Programming and Performance	4	8223	July 14, 2010
simple global data copy using shared memory why bother shared memory when simply copy global data CUDA Programming and Performance	4	1642	March 9, 2012
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3582	July 12, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	633	July 12, 2011
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3321	October 14, 2009
Worth loading all to shared memory? CUDA Programming and Performance	2	2678	February 25, 2008
Correct Use of Shared Memory? CUDA Programming and Performance	1	751	January 6, 2010
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	4059	March 2, 2009
Shared memory doubt CUDA Programming and Performance	5	4685	June 11, 2008
store global parameters in shared memory? CUDA Programming and Performance	3	2087	October 19, 2009

Assigning from shared to global memory Question about global memory and assigning complex statements

Related topics