I’m really sorry for asking you again the similar question, but all my attempts to rewrite this function using shared memory are futile. Here I have quite a number of requests to input array and if the input size comprise only one block there is no problem to use shared memory. The size of needed input array as well as output one should be more than 2^20 elements.
I tried to divide input array into smaller ones and store each part of this array into different shared memory blocks… but how to make this parts communicate to each other since they are in different block’s shared memories?
Thank you in advance.
Here is the code without using shared memory:
__global__
void MYFT(float * inreal, float * outreal, const int n)
{
int k = threadIdx.x + blockDim.x * blockIdx.x;
if (k < n)
{
float sumreal = 0.0f;
//#pragma unroll
for (int t = 0; t < n; t++) {
float angle = 2 * 3.14159265359 * t * k / n;
sumreal += inreal[t] * cos(angle);
}
outreal[k] = sumreal;
}
__syncthreads();
}
Side remark: Consider using cospi() instead of cos() whenever the factor PI is baked into the argument to cos(). This has advantages with regard to performance and accuracy. E.g.
float tmp = 2 * t * k / n;
sumreal += inreal[t] * cospi(tmp);