Clever data rearrangment

Hello,

My application involves rearranging a four dimensional data structure while weighing values along one dimension, which must be the last one in order to satisfy the batch nature of cuFFT. I wonder what the potential is to make the following code faster, perhaps by using shared memory in some form, but stillallowing a subsequent FFT:

[codebox]#define SAMPLES 512

#define RANGES 60

#define ROWS 8

#define COLS 8

//create a 4 dimensional structure with COLSROWSSAMPLES horizontal bars for FFT

global void

rowsKernel(FloatComplex* spectrum, FloatComplex* hbar, float* w)

{

int k = blockIdx.x;//sample time

int h = blockIdx.y;//range index

int i = threadIdx.x;//row

int j = threadIdx.y;//columns

unsigned int source = h*SAMPLES*COLS*ROWS + i*SAMPLES*COLS + j*SAMPLES + k;

unsigned int target = h*COLS*ROWS*SAMPLES + k*COLS*ROWS + i*COLS + j;

hbar[target].x = spectrum[source].x * w[j];

hbar[target].y = spectrum[source].y * w[j];

}

void runTest(){

dim3 threads(ROWS, COLS);

dim3 grid(SAMPLES, RANGES);

rowsKernel<<< grid, threads >>>(d_spectrum, d_hbar, d_wh);

}

[/codebox]

Any hints are very much appreciated,

peter