Prefetch an array different from the block size? efficient 'staging'

I’m trying to design an efficient kernel, and right now everything looks like it will fall perfectly into place except the prefetch into shared memory, which I’m not sure how to deal with.

The block size I want to use is 16 x 16 threads, but the input data per block is 22 x 22 floats. I really want this to be in shared memory, as it gets 49 hits per element (the next kernel that the data goes to after this one is similar, but with 123 hits per element).

The method I’m thinking of right now is to do something like this:

#define BLOCK_WIDTH 16

#define ARRAY_WIDTH 22

#define IPT_WIDTH 1024

__global__ void myKernel( float* input, float* output ) {

__shared__ float myInputs[ ARRAY_WIDTH * ARRAY_WIDTH ];

if( threadIdx.x == 0 and threadIdx.y == 0 ) {

      int curIndex = 0;

      int frameIndex = blockIdx.x + IPT_WIDTH * blockIdx.y;

      int frameRetrace = IPT_WIDTH - BLOCK_WIDTH;

      for( int i = 0; i < ARRAY_WIDTH; i++ ) {

      for( int j = 0; j < ARRAY_WIDTH; j++) {

                    myInputs[ curIndex ] = input[ frameIndex ];




               frameIndex += frameRetrace;




                    /*                    rest of kernel code                        */


Can anyone tell me whether the above looks like a reasonable way to do this?


As I have understood - the problem consists in effective puting the block 22x22 into shared memory. It will be effective if everyone thread will read it’s own pice of memory.

Try to do something like this:

union __align__(16)  un16


	unsigned char c[16];


typedef un16 union128bit;

__global__ void myKernel( float* input) 


extern __shared__ float myInputs[];

int threadnum = threadIdx.x + 16*threadIdx.y;

//22*22*4 = 16*121

if (threadnum<121)


           //here calculating index

           *(union128bit*)&myInputs[4*threadnum] = *(union128bit*)&input[index]; 

           //reading 16 bit



//process shared


To effective execution this code use cuda 0.9

I think I can see what you’re saying, but I don’t really understand why it would help to have more than 16 threads at a time calculating indices / making requests (assuming there’s one control unit per multiprocessor to resolve opcodes with). The only thing I can think of is that these accesses aren’t really turning up in the compiled kernel at all, but instead get preprocessed and diverted to some kind of programmable memory controller?

Looking back, though, it does appear that what I had would be slow, as the GPU will treat those all as separate accesses, and I would only be getting 32 bits out of 384 (plus, the row / column strobes will probably get triggered each time too). This leads me to think that I do need to use something like what the above post describes so the chip’s batching feature will get used.

What would be really nice, though, would be some kind of device memcpy instruction that would let me directly request an n-byte continuous read / write, like this:

prefetch_to_shared( global address, shared address, bytes )
dump_from_shared( global address, shared address, bytes )


There’s no way the G80 doesn’t have a feature like this, as it’s the only way to get decent performance from DRAM. Maybe there is (or should be) a way to access it by itself, rather than indirectly through per-thread access requests?


I don’t want to keep bumping this thread needlessly, so I’ll just put an update here:

Using a for-next loop of 22 threads reading one contiguous section at a time makes it go fast enough that I don’t care about speeding it up any more.

The operation I was porting was about a 1-second process on the CPU, while the G80 version is completing in about 7-8 milliseconds (from the time the kernel is called until the thread unblocks). By comparison, it took a LOT of effort to get the original version as fast as it was, while the GPU variant was just sort-of thrown together…