shared memory

AlexIanevski · March 11, 2015, 8:09pm

Good evening.

I’m really sorry for asking you again the similar question, but all my attempts to rewrite this function using shared memory are futile. Here I have quite a number of requests to input array and if the input size comprise only one block there is no problem to use shared memory. The size of needed input array as well as output one should be more than 2^20 elements.
I tried to divide input array into smaller ones and store each part of this array into different shared memory blocks… but how to make this parts communicate to each other since they are in different block’s shared memories?

Thank you in advance.

Here is the code without using shared memory:

__global__
void MYFT(float * inreal, float * outreal, const int n) 
{
    int k = threadIdx.x + blockDim.x * blockIdx.x;
	if (k < n)
	{
		float sumreal = 0.0f;
			//#pragma unroll
			for (int t = 0; t < n; t++) { 
				float angle = 2 * 3.14159265359 * t * k / n;
				sumreal += inreal[t] * cos(angle);
			}
			outreal[k] = sumreal;
	}
	__syncthreads();
}

njuffa · March 11, 2015, 8:17pm

Side remark: Consider using cospi() instead of cos() whenever the factor PI is baked into the argument to cos(). This has advantages with regard to performance and accuracy. E.g.

float tmp = 2 * t * k / n;
sumreal += inreal[t] * cospi(tmp);

Robert_Crovella · March 12, 2015, 3:59am

You cannot communicate the values in shared memory in one block, to another block, without making a trip through global memory.

You will not be able to load a large input array into the shared memory in several blocks and use it from every thread that way.

AlexIanevski · March 12, 2015, 1:03pm

First off, thanks for your replies.

njuffa, which library should I include? Neither cospi nor cospif does work…

txbob, I did like that(see below please), but there is no advantage of shared memory.

Could you recommened any improvements to increase perfomance?
Does loop unrolling would be useful here?
Thank you in advance.

__global__
void MYFT(float * inreal, float * outreal, const int n) 
{
	int k = threadIdx.x + blockDim.x * blockIdx.x;
	if (k < n)
	{
		extern __shared__ float in_cpy[];

		for (int step = 0; step < gridDim.x; step++){ 

			in_cpy[threadIdx.x] = inreal[threadIdx.x + blockDim.x * step];
			__syncthreads();

			float sumreal = 0.0f;

			for (int t = 0; t < blockDim.x; t++) { 
				float angle = 2 * 3.14159265359 * (t + step * blockDim.x) * k / n;
				sumreal += in_cpy[t] * cos(angle);
			}
                        __syncthreads();
			outreal[k] += sumreal;
		}
	}
}

as well as this implementation. whic has the similar execution time as the one without using shared memory program.

extern __shared__ float array[];

__global__
void MYFT(float * inreal, float * outreal, const int n) 
{
    int k = threadIdx.x + blockDim.x * blockIdx.x;
	if (k < n)
	{
		 float * in_cpy = (float*)array;    //shared memory array (input)
		 float * out_cpy = (float*)&in_cpy[blockDim.x];  //output

		out_cpy[threadIdx.x] = 0.0f;

		for (int step = 0; step < gridDim.x; step++){ 
			
			in_cpy[threadIdx.x] = inreal[threadIdx.x + blockDim.x * step];
			__syncthreads();

			float sumreal = 0.0f;

				for (int t = 0; t < blockDim.x; t++) { 
					float angle = 2 * 3.14159265359 * (t + step * blockDim.x) * k / n;
					sumreal += in_cpy[t] * cos(angle);
				}
				__syncthreads();
				out_cpy[threadIdx.x] += sumreal;
		}
		outreal[k] = out_cpy[threadIdx.x];
	}
}

Topic		Replies	Views
shared memory computation CUDA Programming and Performance	0	2113	September 30, 2010
using shared memory CUDA Programming and Performance	6	3023	September 17, 2009
shared memory access CUDA Programming and Performance	3	2883	April 24, 2012
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16576	July 21, 2017
how to use shared memory CUDA Programming and Performance	6	7805	September 5, 2010
use of shared memory CUDA Programming and Performance	2	1080	February 16, 2011
Shared memory example CUDA Programming and Performance	1	1464	April 24, 2012
Dynamic Shared memory CUDA Programming and Performance	3	6168	June 4, 2009
Using shared Memory CUDA Programming and Performance	3	4939	March 11, 2012
Shared Memory question CUDA Programming and Performance	5	2968	November 25, 2016

shared memory

Related topics