Assigning from shared to global memory Question about global memory and assigning complex statements

This is a quick (and perhaps dumb) question. Say I have a kernel function:

__global__ void foo_inline(unsigned int n, float* lhs, const float* rhs, const float alpha) {

	__shared__ float s_data[];

	float* s_lhs = s_data;

	float* s_rhs = &s_data[n];

	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < n) {

		s_lhs[threadIdx.x] = lhs[i];

		s_rhs[threadIdx.x] = rhs[i];

		lhs[i] = s_lhs[threadIdx.x] * alpha * s_rhs[threadIdx.x];

	}

}

I am copying the arrays into shared memory so as to reduce the access time, but I have inlined the computation and the assignment back out to global memory. My question is, is this the same as storing the result in shared memory and then assigning to global memory? i.e.

__global__ void foo(unsigned int n, float* lhs, const float* rhs, const float alpha) {

	__shared__ float s_data[];

	float* s_lhs = s_data;

	float* s_rhs = &s_data[n];

	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < n) {

		s_lhs[threadIdx.x] = lhs[i];

		s_rhs[threadIdx.x] = rhs[i];

		s_lhs[threadIdx.x] *= alpha * s_rhs[threadIdx.x];

		lhs[i] = s_lhs[threadIdx.x];

	}

}

It would seem that assigning the result back to shared memory is unnecessary, and that this will yield the same efficiency (perhaps even less, because we need to assign to a shared memory space), but I just wanted to make sure.

Thanks!

Since you’re not reusing any of the data from shared memory, there’s really no point to load it into shared memory first.
Also, whenever you’re not sure about performance benefit, use measurements with a timing experiment to find out which approach is best.

I see what you’re saying, and I was of the same position. But all of what I read online and in the CUDA examples says that shared memory should be used for anything more than a simple assignment. I mean, if the examples advise to use shared memory for something as trivial as a matrix transpose – which is really just assigning from one index to another – then there must be a reason to load into shared memory for all operations. No?

EDIT: You’re absolutely right, though. Changing back to using just global memory has little effect, according to my timing results. I guess it might make more sense if one of the variables were a scalar that only needed to be read once and used by all threads.

Memory coalescing is prime reason to use shared memory in “trivial” cases. On a compute 1.1 device, this:

i = threadIdx.x;

results[i] = a[i] + b[i] + c[i];

allows coalesced loads and stores, whereas this:

i = threadIdx.x;

results[i-2] = a[i] + b[i-1] + c[i-2]

does not.