shared memory

Good evening.

I’m really sorry for asking you again the similar question, but all my attempts to rewrite this function using shared memory are futile. Here I have quite a number of requests to input array and if the input size comprise only one block there is no problem to use shared memory. The size of needed input array as well as output one should be more than 2^20 elements.
I tried to divide input array into smaller ones and store each part of this array into different shared memory blocks… but how to make this parts communicate to each other since they are in different block’s shared memories?

Thank you in advance.

Here is the code without using shared memory:

__global__
void MYFT(float * inreal, float * outreal, const int n) 
{
    int k = threadIdx.x + blockDim.x * blockIdx.x;
	if (k < n)
	{
		float sumreal = 0.0f;
			//#pragma unroll
			for (int t = 0; t < n; t++) { 
				float angle = 2 * 3.14159265359 * t * k / n;
				sumreal += inreal[t] * cos(angle);
			}
			outreal[k] = sumreal;
	}
	__syncthreads();
}

Side remark: Consider using cospi() instead of cos() whenever the factor PI is baked into the argument to cos(). This has advantages with regard to performance and accuracy. E.g.

float tmp = 2 * t * k / n;
sumreal += inreal[t] * cospi(tmp);

You cannot communicate the values in shared memory in one block, to another block, without making a trip through global memory.

You will not be able to load a large input array into the shared memory in several blocks and use it from every thread that way.

First off, thanks for your replies.

njuffa, which library should I include? Neither cospi nor cospif does work…

txbob, I did like that(see below please), but there is no advantage of shared memory.

Could you recommened any improvements to increase perfomance?
Does loop unrolling would be useful here?
Thank you in advance.

__global__
void MYFT(float * inreal, float * outreal, const int n) 
{
	int k = threadIdx.x + blockDim.x * blockIdx.x;
	if (k < n)
	{
		extern __shared__ float in_cpy[];

		for (int step = 0; step < gridDim.x; step++){ 

			in_cpy[threadIdx.x] = inreal[threadIdx.x + blockDim.x * step];
			__syncthreads();

			float sumreal = 0.0f;

			for (int t = 0; t < blockDim.x; t++) { 
				float angle = 2 * 3.14159265359 * (t + step * blockDim.x) * k / n;
				sumreal += in_cpy[t] * cos(angle);
			}
                        __syncthreads();
			outreal[k] += sumreal;
		}
	}
}

as well as this implementation. whic has the similar execution time as the one without using shared memory program.

extern __shared__ float array[];

__global__
void MYFT(float * inreal, float * outreal, const int n) 
{
    int k = threadIdx.x + blockDim.x * blockIdx.x;
	if (k < n)
	{
		 float * in_cpy = (float*)array;    //shared memory array (input)
		 float * out_cpy = (float*)&in_cpy[blockDim.x];  //output

		out_cpy[threadIdx.x] = 0.0f;

		for (int step = 0; step < gridDim.x; step++){ 
			
			in_cpy[threadIdx.x] = inreal[threadIdx.x + blockDim.x * step];
			__syncthreads();

			float sumreal = 0.0f;

				for (int t = 0; t < blockDim.x; t++) { 
					float angle = 2 * 3.14159265359 * (t + step * blockDim.x) * k / n;
					sumreal += in_cpy[t] * cos(angle);
				}
				__syncthreads();
				out_cpy[threadIdx.x] += sumreal;
		}
		outreal[k] = out_cpy[threadIdx.x];
	}
}