Hello,
My application involves rearranging a four dimensional data structure while weighing values along one dimension, which must be the last one in order to satisfy the batch nature of cuFFT. I wonder what the potential is to make the following code faster, perhaps by using shared memory in some form, but stillallowing a subsequent FFT:
[codebox]#define SAMPLES 512
#define RANGES 60
#define ROWS 8
#define COLS 8
//create a 4 dimensional structure with COLSROWSSAMPLES horizontal bars for FFT
global void
rowsKernel(FloatComplex* spectrum, FloatComplex* hbar, float* w)
{
int k = blockIdx.x;//sample time
int h = blockIdx.y;//range index
int i = threadIdx.x;//row
int j = threadIdx.y;//columns
unsigned int source = h*SAMPLES*COLS*ROWS + i*SAMPLES*COLS + j*SAMPLES + k;
unsigned int target = h*COLS*ROWS*SAMPLES + k*COLS*ROWS + i*COLS + j;
hbar[target].x = spectrum[source].x * w[j];
hbar[target].y = spectrum[source].y * w[j];
}
void runTest(){
dim3 threads(ROWS, COLS);
dim3 grid(SAMPLES, RANGES);
rowsKernel<<< grid, threads >>>(d_spectrum, d_hbar, d_wh);
}
[/codebox]
Any hints are very much appreciated,
peter