Why the timings of these two ways are similar? one is writing randomly, the other is writing contigo

Way 1:
#define AIR(x) (IR(x)&0x7fffffff)
#define IR(x) ((udword&)(x))
kernel code:
udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
shared float3 pairsCen[512];
shared float3 pairsExt[512];


unsigned int *index0 = &d_Sorted[objID];
float3 cen0 = pairsCen[index0];
float3 extent0 = pairsExt[index0];
unsigned int *index = index0 + 1;
udword offset = 0;
i=0;
while(i<32)
{
float3 cen = pairsCen[index];
float3 extent = pairsExt[index];
float t1 = cen0.y-cen.y;
float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)
{
float t2 = cen0.z-cen.z;
float e2 = extent0.z+extent.z;
if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)
{
d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1;
offset++;
}
}
index++;
i++;

}
Way2:
#define AIR(x) (IR(x)&0x7fffffff)
#define IR(x) ((udword&)(x))
kernel code:
udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
shared float3 pairsCen[512];
shared float3 pairsExt[512];


unsigned int *index0 = &d_Sorted[objID];
float3 cen0 = pairsCen[index0];
float3 extent0 = pairsExt[index0];
unsigned int *index = index0 + 1;
udword offset = 0;
udword indexBits = 0;
udword mask = 1;
i=0;
while(i<32)
{

float t1 = cen0.y-cen.y;
float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)
{
float t2 = cen0.z-cen.z;
float e2 = extent0.z+extent.z;
if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)
{
indexBits = indexBits | mask << offset;
offset++;
}
}
index++;
i++;

}
__syncthreads();
d_mulpairs[th_id] = indexBits;

In way 1, writing to global memory is random; While writing to global memory is contiguous in way 2. So, way 2 should be faster than way 1.
But the timing results are similiar. I wonder the reason. It doesn’t make sense. And, I check the PTX code and find all the local variables stored in registers.
But register is fast for access. Anybody who can give me some advice? Thanks very much!

In my opinions I would check two main points below.

  1. The size of processed data, this size may be too small to see the advantages of coherece reading and writing.
  2. Get the counters from CUDA PROFILER, there are many utilize counters to evaluate a cuda program. In this case you should pay attention at incoherence in loading and storing, warp serials (bank conflicts).
    good luck. :)

Thanks. I run the CUDA PROFILER, both of them have warp serials.

But, if I remove the if sentence,

//if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

//if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

}

}

then the warp serials will disappear. And timing decreased much.

But if I modified if parts like this

float t1 = cen0.y-cen.y;

float e1 = extent0.y+extent.y;

//if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

float t2 = cen0.z-cen.z;

float e2 = extent0.z+extent.z;

//if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

mask = AIR(t1) <= IR(e1) && AIR(t2) <= IR(e2);

indexBits = indexBits | mask << offset;

//d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1; //

offset++;

}

}

The warp serial still existed. And timing had no change.

I wonder the reason which causes the warp serials.

Is the reason of register latency for read after write?

Anyone who can help me? Thanks very much!

And I use 16 registers per thread and 12352Bytes shared memory per block.
And occupancy is 50%, because the shared mem is more than 8KB.
My GPU is GTX 280.

(Although, every operation has at least a 12 clock latency due to register potential read-after-write access stalls.
This is why we should schedule at least 192 (active) threads per SM. since 16 threads operate in one clock,
so 12*16=192 threads will cover this 12 clock latency. )
But in my case, I have already used 512 thread and 256 active threads per SM. Why I still have a long register latency?

So, for the way1, is it the problem caused by loop while. It makes access glob mem too many times in one thread.
If write to global memory randomly only one time for one thread, I think it should be faster.
Can anyone give me some advice about this? Thanks a lot!

Sorry, I made a mistake. The expression of variable index is:

index = th_id == 0 ? index0 + 1 : index0 + th_id * 32;

So, I find it caused shared mem bank conflicts.

float3 cen = pairsCen[index];

float3 extent = pairsExt[index];

But, I have no idea on how to slove it. Anyone who can give me any advice?

Thanks very much!

If I modified the index = th_id == 0 ? index0 + 1 : index0 + th_id, the timing will be reduced significantly.