I have a kernel whose job is to form an element by element product of a lot of matrices. In pseudo-code
[codebox]__global__ void MakeFG( const cuFloatComplex *Gs, const cuFloatComplex *F, cuFloatComplex *FGs ) {// Kernel to form the FG products
// Assumes that one block is launched for each entry in the G array
// Gs is actually G[nGs][ARRAY_SIZE], flattened out for the GPU
// Also assumes that the number of threads per block divides ARRAY_SIZE
const unsigned int tx = threadIdx.x;
const unsigned int bx = blockIdx.x;
const unsigned int by = blockIdx.y;
// Location of first element of array worked on by this block
const unsigned int iCompare = ARRAY_SIZE*(bx+(by*BLOCKS_PER_ROW));
for( unsigned int i=0; i<(ARRAY_SIZE/blockDim.x); i++ ) {
FGs[tx+(i*blockDim.x)+iCompare] = F[tx+(i*blockDim.x)] * Gs[tx+(i*blockDim.x)+iCompare];
}
}
…
threads.x = kFGProductKernelSize;
threads.y = threads.z = 1;
grid.x = BLOCKS_PER_ROW;
grid.y = nGs / BLOCKS_PER_ROW;
grid.z = 1;
MakeFG<<<grid,threads>>>( d_Gs, d_F, d_FGs );
[/codebox]
where [font=“Courier New”]ARRAY_SIZE[/font] is 1024, [font=“Courier New”]kFGProductKernelSize[/font] is 256, [font=“Courier New”]BLOCKS_PER_ROW[/font] is 1024 and [font=“Courier New”]nGs[/font] is 131072. Note that with this set up, [font=“Courier New”]ARRAY_SIZE/blockDim.x[/font] will always be an integer. This kernel runs in about 40ms.
Noting that the same [font=“Courier New”]F[ARRAY_SIZE][/font] is used by all the blocks, I thought I’d put this into constant memory on the device, [font=“Courier New”]constant cuFloatComplex dc_F[ARRAY_SIZE][/font]. The kernel then became
[codebox]global void MakeFGWithConstant( const cuFloatComplex *Gs, cuFloatComplex *FGs ) {
const unsigned int tx = threadIdx.x;
const unsigned int bx = blockIdx.x;
const unsigned int by = blockIdx.y;
const unsigned int iCompare = ARRAY_SIZE*(bx+(by*BLOCKS_PER_ROW));
for( unsigned int i=0; i<(ARRAY_SIZE/blockDim.x); i++ ) {
FGs[tx+(i*blockDim.x)+iCompare] = dc_F[tx+(i*blockDim.x)] * Gs[tx+(i*blockDim.x)+iCompare];
}
}[/codebox]
To my surprise, this kernel takes about 80ms to run, and according to [font=“Courier New”]cudaprof[/font], a lot of warps are getting serialised.
What’s going on here? Is access to constant memory always serialised? I had really expected using constant memory to make things faster, since the entire [font=“Courier New”]dc_F[/font] array should have been cached on each multiprocessor after the completion of the first block on that multiprocessor.