Write to shared memory

GoGdizzY · June 30, 2009, 2:41am

Recently I started to learn CUDA programming, and be attracted at once. Now I want to program a FFT code with CUDA. Before this, I tested a 2-dimension using CUFFT lib, it aquire a good performance in about 25000000 clock ticks( one third for memory copy). But my code do 1-dimension FFT will cost the same time. I checked it, found that it’s very time-consuming while writing shared memory.

To avoid bank conflict, I set stride to be a big odd number, where’s my fault? Can you help me and give me some advice? thank you!

[codebox]#define MATRIX_SIZE (1<<9)

#define BLOCK_SIZE (1<<9)

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

shared DATA_TYPE s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

shared DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

shared DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

shared DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

pos = tid * stride % MATRIX_SIZE;

s_DataR[pos] = dg_buffer[blockIdx.x * BLOCK_SIZE + rev]; //------------------------------------time-consuming

s_DataI[pos] = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev]; //-----------------------------------time-consuming

float theta = GV_2PI / N;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

__syncthreads();

int step, w;

for(step = 1;step<N;step=step*2)

{

if(tid & step)

{

w = ( tid & ( step - 1 ) ) * stride % MATRIX_SIZE;

DATA_TYPE tempR = s_DataR[pos] * s_CosTable[w] + s_DataI[pos] * s_SinTable[w];

DATA_TYPE tempI = s_DataI[pos] * s_CosTable[w] - s_DataR[pos] * s_SinTable[w];

pre = ( tid - step ) * stride % MATRIX_SIZE;

s_DataR[pos] = s_DataR[pre] - tempR; //-----------------------------------------------time-consuming

s_DataI[pos] = s_DataI[pre] - tempI; //-----------------------------------------------time-consuming

s_DataR[pre] += tempR; //-----------------------------------------------time-consuming

s_DataI[pre] += tempI; //-----------------------------------------------time-consuming

}

__syncthreads();

}

}[/codebox]

Nico · June 30, 2009, 8:46am

Quote from the cuda programming guide:

N.

GoGdizzY · June 30, 2009, 9:57am

Thank you, Nico! I’ll remember these skills.

But the bottleneck is store data to shared memory. As tested, transfer data from/to host will cost 7,500,000 clock ticks.

And CUFFT execution costs about 17,000,000 ticks.

In each kernel, My code read Global Memory 2 times, read Shared Memory many times(this seems not time-consuming),
and write Shared Memory many times(about 40 times, this costs a lot of time).

Read/Write Shared memory cost about 11,000,000 ticks.

Is that Writing will cost order of magnitude times of Reading? Or maybe my code has bank conflict?

Nico · June 30, 2009, 10:18am

What DATA_TYPE are you using?

N.

GoGdizzY · July 2, 2009, 5:01am

float

Nico · July 2, 2009, 7:25am

Those global memory accesses probably aren’t coalesced.

Here are a number of bank conflicts, depending on the value of step.

If you commented out the writes to shared memory in the loop for ‘timing purposes’, then the compiler probably optimized the loop and threw away the calculation of tempR and templ entirely.

But if you uncomment them, then these calculations need to be performed (which are IMO the slow parts of the loop because the pos and pre indices do not generate any bank conflicts)

N.

GoGdizzY · July 2, 2009, 8:19am

Now I comment the other code, just compare the followings, and the results were also strange:

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

tempR = dg_buffer[blockIdx.x * BLOCK_SIZE + rev];

tempI = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev];

__threadfence_block();

s_DataR[pos] = tempR;

s_DataI[pos] = tempI;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 11,500,000 ticks

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

tempR = dg_buffer[blockIdx.x * BLOCK_SIZE + rev];

tempI = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev];

__threadfence_block();

s_DataR[pos] = tid;

s_DataI[pos] = tid;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 8,160,000 ticks

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

tempR = dg_buffer[blockIdx.x * BLOCK_SIZE + rev];

tempI = dg_buffer[N*N + blockIdx.x * BLOCK_SIZE + rev];

__threadfence_block();

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 8,140,000 ticks

[codebox]

global

void FFT_2D_Radix2(DATA_TYPE* dg_buffer, int N )

{

int tid, rev, pos, pre, stride = 33;

tid = threadIdx.x;

rev = bit_reverse3(tid, tail_zero_nums(N));

__shared__ DATA_TYPE  s_DataR[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_DataI[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_CosTable[MATRIX_SIZE]; // 512*4 = 2kB

__shared__ DATA_TYPE s_SinTable[MATRIX_SIZE]; // 512*4 = 2kB

float theta = GV_2PI / N;

pos = (tid * stride) & (MATRIX_SIZE - 1);

DATA_TYPE tempR, tempI;

__threadfence_block();

s_DataR[pos] = tempR;

s_DataI[pos] = tempI;

s_SinTable[pos] = __sinf( theta * tid );

s_CosTable[pos] = __cosf( theta * tid );

}

[/codebox]

timing: 8,220,000

GoGdizzY · July 2, 2009, 8:59am

OK, all are because of optimization.

Thank you, Nico!

cvnguyen · July 2, 2009, 9:25am

Reading the global memory in bit-reverse pattern is not coalesced at all. This task would be dominant in your kernel.

You can read the input data from global memory in linear pattern first and then reorganize them in bit-reverse pattern. Otherwise, you can use the reverse butterfly algorithm.

cvnguyen · July 2, 2009, 9:48am

for(step = 1;step<N;step=step*2)

 {

  if(tid & step)

  {

   ...............

   pre = ( tid - step ) * stride % MATRIX_SIZE;

s_DataR[pos] = s_DataR[pre] - tempR;	//-----------------------------------------------time-consuming 

   s_DataI[pos] = s_DataI[pre] - tempI;		 //-----------------------------------------------time-consuming 

   s_DataR[pre] += tempR;						//-----------------------------------------------time-consuming 

   s_DataI[pre] += tempI;						   //-----------------------------------------------time-consuming 

  }

  __syncthreads();

 }

}

For step < 32, there is only 50% warp occupancy inside the above IF.

I think your ‘pre’ and ‘pos’ will be also free of bank conflict without the stride.

cvnguyen · July 2, 2009, 9:50am

Your algorithm is:

External Image

The reverse is:

External Image

GoGdizzY · July 2, 2009, 9:57am

ok, thanks!

GoGdizzY · July 2, 2009, 10:05am

Sorry, another question!

what is the benifit of the Reverse Algorithm?

It still need to re-arrange the order of the arrays after butterfly transforming.

cvnguyen · July 2, 2009, 10:19am

Well, both of them are similar, but I found that some people prefer one to the other.

You can make your global reads coalesced as follows:

tmp = global_data[tid];

shared_data[bit_reverse(tid)] = tmp;