Why the timings of these two ways are similar? one is writing randomly, the other is writing contigo

Fu-chang · December 21, 2009, 12:15pm

Way 1:
#define AIR(x) (IR(x)&0x7fffffff)
#define IR(x) ((udword&)(x))
kernel code:
udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
shared float3 pairsCen[512];
shared float3 pairsExt[512];
…
…
unsigned int *index0 = &d_Sorted[objID];
float3 cen0 = pairsCen[index0];
float3 extent0 = pairsExt[index0];
unsigned int *index = index0 + 1;
udword offset = 0;
i=0;
while(i<32)
{
float3 cen = pairsCen[index];
float3 extent = pairsExt[index];
float t1 = cen0.y-cen.y;
float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)
{
float t2 = cen0.z-cen.z;
float e2 = extent0.z+extent.z;
if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)
{
d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1;
offset++;
}
}
index++;
i++;

}
Way2:
#define AIR(x) (IR(x)&0x7fffffff)
#define IR(x) ((udword&)(x))
kernel code:
udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
shared float3 pairsCen[512];
shared float3 pairsExt[512];
…
…
unsigned int *index0 = &d_Sorted[objID];
float3 cen0 = pairsCen[index0];
float3 extent0 = pairsExt[index0];
unsigned int *index = index0 + 1;
udword offset = 0;
udword indexBits = 0;
udword mask = 1;
i=0;
while(i<32)
{

float t1 = cen0.y-cen.y;
float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)
{
float t2 = cen0.z-cen.z;
float e2 = extent0.z+extent.z;
if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)
{
indexBits = indexBits | mask << offset;
offset++;
}
}
index++;
i++;

}
__syncthreads();
d_mulpairs[th_id] = indexBits;

In way 1, writing to global memory is random; While writing to global memory is contiguous in way 2. So, way 2 should be faster than way 1.
But the timing results are similiar. I wonder the reason. It doesn’t make sense. And, I check the PTX code and find all the local variables stored in registers.
But register is fast for access. Anybody who can give me some advice? Thanks very much!

Quoc_Vinh · December 22, 2009, 2:29am

In my opinions I would check two main points below.

The size of processed data, this size may be too small to see the advantages of coherece reading and writing.
Get the counters from CUDA PROFILER, there are many utilize counters to evaluate a cuda program. In this case you should pay attention at incoherence in loading and storing, warp serials (bank conflicts).
good luck. :)

Fu-chang · December 22, 2009, 4:31am

Thanks. I run the CUDA PROFILER, both of them have warp serials.

But, if I remove the if sentence,

//if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

//if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

}

then the warp serials will disappear. And timing decreased much.

But if I modified if parts like this

float t1 = cen0.y-cen.y;

float e1 = extent0.y+extent.y;

//if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

float t2 = cen0.z-cen.z;

float e2 = extent0.z+extent.z;

//if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

mask = AIR(t1) <= IR(e1) && AIR(t2) <= IR(e2);

indexBits = indexBits | mask << offset;

//d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1; //

offset++;

}

The warp serial still existed. And timing had no change.

I wonder the reason which causes the warp serials.

Is the reason of register latency for read after write?

Anyone who can help me? Thanks very much!

Fu-chang · December 22, 2009, 2:11pm

And I use 16 registers per thread and 12352Bytes shared memory per block.
And occupancy is 50%, because the shared mem is more than 8KB.
My GPU is GTX 280.

(Although, every operation has at least a 12 clock latency due to register potential read-after-write access stalls.
This is why we should schedule at least 192 (active) threads per SM. since 16 threads operate in one clock,
so 12*16=192 threads will cover this 12 clock latency. )
But in my case, I have already used 512 thread and 256 active threads per SM. Why I still have a long register latency?

Fu-chang · December 28, 2009, 9:23am

So, for the way1, is it the problem caused by loop while. It makes access glob mem too many times in one thread.
If write to global memory randomly only one time for one thread, I think it should be faster.
Can anyone give me some advice about this? Thanks a lot!

Fu-chang · January 3, 2010, 3:13pm

Way 1:

define AIR(x) (IR(x)&0x7fffffff)

define IR(x) ((udword&)(x))

kernel code:

udword tx = threadIdx.x;

udword ty = threadIdx.y;

udword bw = blockDim.x;

udword bh = blockDim.y;

udword tid = __umul24(ty, bw) + tx;

udword bid = __umul24(__umul24(blockIdx.x, bw), bh);

udword th_id = bid + tid;

shared float3 pairsCen[512];

shared float3 pairsExt[512];

…

…

unsigned int *index0 = &d_Sorted[objID];

float3 cen0 = pairsCen[index0];

float3 extent0 = pairsExt[index0];

unsigned int *index = index0 + 1;

udword offset = 0;

i=0;

while(i<32)

{

float3 cen = pairsCen[index];

float3 extent = pairsExt[index];

float t1 = cen0.y-cen.y;

float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

float t2 = cen0.z-cen.z;

float e2 = extent0.z+extent.z;

if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1;

offset++;

}

}

index++;

i++;

}

Sorry, I made a mistake. The expression of variable index is:

index = th_id == 0 ? index0 + 1 : index0 + th_id * 32;

So, I find it caused shared mem bank conflicts.

float3 cen = pairsCen[index];

float3 extent = pairsExt[index];

But, I have no idea on how to slove it. Anyone who can give me any advice?

Thanks very much!

Fu-chang · January 6, 2010, 12:28pm

If I modified the index = th_id == 0 ? index0 + 1 : index0 + th_id, the timing will be reduced significantly.

Topic		Replies	Views
Why no obvious difference between two ways? Writing to global memory CUDA Programming and Performance	1	4457	December 23, 2009
Problem on writing to global memory with warp serialize? writing to global memory cost much time CUDA Programming and Performance	4	1410	December 12, 2009
read from global mem vs write to global mem CUDA Programming and Performance	13	6462	January 22, 2009
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25740	July 5, 2007
global reading vs writing latency CUDA Programming and Performance	3	3701	March 23, 2007
Register read/write takes too long CUDA Programming and Performance	7	2456	May 29, 2010
global memory latency CUDA Programming and Performance	12	16195	December 13, 2007
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9371	January 7, 2008
How to write efficient from local to glocal memory Writing - time problems CUDA Programming and Performance	3	5542	December 5, 2007
Why the two coalesced memory write vary so much? CUDA Programming and Performance	2	3879	March 26, 2010

Why the timings of these two ways are similar? one is writing randomly, the other is writing contigo

Related topics