question in the sample code (

global void init_array(int *g_data, int *factor)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
g_data[idx] = *factor; // uncoalesced on purpose to burn some time

The above codes are quoted from

The comment says this is a uncoalesced memory access pattern but it seems to me that this is coalesced.

tid 0 for g_data[0], tid 1 for g_data[1],…, and g_data is of 4 bytes (int) length (aligned). I think this complies to the definition of coalesced access unless the base address of g_data is unaligned.

Can anyone help identify where I am wrong?


You are correct: that write is coalesced unless blockDim.x is not a multiple of 32. I glanced at the simpleStreams code and it does seam that blockDim.x is a multiple of 32 so it is coalesced.

The reads and writes of data pointed to by g_data are coalesced. It’s the read of factor that’s not coalesced - all 16 threads in a halfwarp read the same address from global memory. The sample was designed to illustrate the benefit and use of streams, so everything (including) the kernel was kept really simple. Reading factor is uncoalesced so that the kernel takes more time and the benefit of overlap is observable (if you change factor to be passed an an int, rather than a pointer, you’ll see that the kernel time goes down significantly).



I assume “factor” is a register variable. As according to the naming convention, factor should look like g_factor.