Write to shared memory

Recently I started to learn CUDA programming, and be attracted at once. Now I want to program a FFT code with CUDA. Before this, I tested a 2-dimension using CUFFT lib, it aquire a good performance in about 25000000 clock ticks( one third for memory copy). But my code do 1-dimension FFT will cost the same time. I checked it, found that it’s very time-consuming while writing shared memory.

To avoid bank conflict, I set stride to be a big odd number, where’s my fault? Can you help me and give me some advice? thank you!

[codebox]#define MATRIX_SIZE (1<<9)

#define BLOCK_SIZE (1<<9)

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

shared DATA_TYPE s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

shared DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

shared DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

shared DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

pos = tid * stride % MATRIX_SIZE;

s_DataR[pos] = dg_buffer[blockIdx.x * BLOCK_SIZE + rev]; //------------------------------------time-consuming

s_DataI[pos] = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev]; //-----------------------------------time-consuming

float theta = GV_2PI / N;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

__syncthreads();

int step, w;

for(step = 1;step<N;step=step*2)

{

if(tid & step)

{

w = ( tid & ( step - 1 ) ) * stride % MATRIX_SIZE;

DATA_TYPE tempR = s_DataR[pos] * s_CosTable[w] + s_DataI[pos] * s_SinTable[w];

DATA_TYPE tempI = s_DataI[pos] * s_CosTable[w] - s_DataR[pos] * s_SinTable[w];

pre = ( tid - step ) * stride % MATRIX_SIZE;

s_DataR[pos] = s_DataR[pre] - tempR; //-----------------------------------------------time-consuming

s_DataI[pos] = s_DataI[pre] - tempI; //-----------------------------------------------time-consuming

s_DataR[pre] += tempR; //-----------------------------------------------time-consuming

s_DataI[pre] += tempI; //-----------------------------------------------time-consuming

}

__syncthreads();

}

}[/codebox]

Quote from the cuda programming guide:

N.

Thank you, Nico! I’ll remember these skills.

But the bottleneck is store data to shared memory. As tested, transfer data from/to host will cost 7,500,000 clock ticks.

And CUFFT execution costs about 17,000,000 ticks.

In each kernel, My code read Global Memory 2 times, read Shared Memory many times(this seems not time-consuming),
and write Shared Memory many times(about 40 times, this costs a lot of time).

Read/Write Shared memory cost about 11,000,000 ticks.

Is that Writing will cost order of magnitude times of Reading? Or maybe my code has bank conflict?

What DATA_TYPE are you using?

N.

float

Those global memory accesses probably aren’t coalesced.

Here are a number of bank conflicts, depending on the value of step.

If you commented out the writes to shared memory in the loop for ‘timing purposes’, then the compiler probably optimized the loop and threw away the calculation of tempR and templ entirely.

But if you uncomment them, then these calculations need to be performed (which are IMO the slow parts of the loop because the pos and pre indices do not generate any bank conflicts)

N.

Now I comment the other code, just compare the followings, and the results were also strange:

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

tempR = dg_buffer[blockIdx.x * BLOCK_SIZE + rev];

tempI = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev];

__threadfence_block();

s_DataR[pos] = tempR;

s_DataI[pos] = tempI;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 11,500,000 ticks

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

tempR = dg_buffer[blockIdx.x * BLOCK_SIZE + rev];

tempI = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev];

__threadfence_block();

s_DataR[pos] = tid;

s_DataI[pos] = tid;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 8,160,000 ticks

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

tempR = dg_buffer[blockIdx.x * BLOCK_SIZE + rev];

tempI = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev];

__threadfence_block();

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 8,140,000 ticks

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

__threadfence_block();

s_DataR[pos] = tempR;

s_DataI[pos] = tempI;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 8,220,000

OK, all are because of optimization.

Thank you, Nico!

Reading the global memory in bit-reverse pattern is not coalesced at all. This task would be dominant in your kernel.

You can read the input data from global memory in linear pattern first and then reorganize them in bit-reverse pattern. Otherwise, you can use the reverse butterfly algorithm.

For step < 32, there is only 50% warp occupancy inside the above IF.

I think your ‘pre’ and ‘pos’ will be also free of bank conflict without the stride.

Your algorithm is:

The reverse is:

ok, thanks!

Sorry, another question!

what is the benifit of the Reverse Algorithm?

It still need to re-arrange the order of the arrays after butterfly transforming.

Well, both of them are similar, but I found that some people prefer one to the other.

You can make your global reads coalesced as follows:

tmp = global_data[tid];

shared_data[bit_reverse(tid)] = tmp;