I’m trying to design an efficient kernel, and right now everything looks like it will fall perfectly into place except the prefetch into shared memory, which I’m not sure how to deal with.
The block size I want to use is 16 x 16 threads, but the input data per block is 22 x 22 floats. I really want this to be in shared memory, as it gets 49 hits per element (the next kernel that the data goes to after this one is similar, but with 123 hits per element).
The method I’m thinking of right now is to do something like this:
#define BLOCK_WIDTH 16
#define ARRAY_WIDTH 22
#define IPT_WIDTH 1024
__global__ void myKernel( float* input, float* output ) {
__shared__ float myInputs[ ARRAY_WIDTH * ARRAY_WIDTH ];
if( threadIdx.x == 0 and threadIdx.y == 0 ) {
int curIndex = 0;
int frameIndex = blockIdx.x + IPT_WIDTH * blockIdx.y;
int frameRetrace = IPT_WIDTH - BLOCK_WIDTH;
for( int i = 0; i < ARRAY_WIDTH; i++ ) {
for( int j = 0; j < ARRAY_WIDTH; j++) {
myInputs[ curIndex ] = input[ frameIndex ];
curIndex++;
frameIndex++;
}
frameIndex += frameRetrace;
}
}
__syncthreads();
/* rest of kernel code */
}
Can anyone tell me whether the above looks like a reasonable way to do this?