Why no obvious difference between two ways? Writing to global memory

Fu-chang · December 21, 2009, 1:17pm

Way 1:
#define AIR(x) (IR(x)&0x7fffffff)
#define IR(x) ((udword&)(x))
kernel code:
udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
shared float3 pairsCen[512];
shared float3 pairsExt[512];
…
…
unsigned int *index0 = &d_Sorted[objID];
float3 cen0 = pairsCen[index0];
float3 extent0 = pairsExt[index0];
unsigned int *index = index0 + 1;
udword offset = 0;
i=0;
while(i<32)
{
float3 cen = pairsCen[index];
float3 extent = pairsExt[index];
float t1 = cen0.y-cen.y;
float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)
{
float t2 = cen0.z-cen.z;
float e2 = extent0.z+extent.z;
if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)
{
d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1;
offset++;
}
}
index++;
i++;

}
Way2:
#define AIR(x) (IR(x)&0x7fffffff)
#define IR(x) ((udword&)(x))
kernel code:
udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
shared float3 pairsCen[512];
shared float3 pairsExt[512];
…
…
unsigned int *index0 = &d_Sorted[objID];
float3 cen0 = pairsCen[index0];
float3 extent0 = pairsExt[index0];
unsigned int *index = index0 + 1;
udword offset = 0;
udword indexBits = 0;
udword mask = 1;
i=0;
while(i<32)
{

float t1 = cen0.y-cen.y;
float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)
{
float t2 = cen0.z-cen.z;
float e2 = extent0.z+extent.z;
if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)
{
indexBits = indexBits | mask << offset;
offset++;
}
}
index++;
i++;

}
__syncthreads();
d_mulpairs[th_id] = indexBits;

In way 1, writing to global memory is random; While writing to global memory is contiguous in way 2. So, way 2 should be faster than way 1.
But the timing results are similiar. I wonder the reason. It doesn’t make sense. And, I check the PTX code and find all the local variables stored in registers.
But register is fast for access. Anybody who can give me some advice? Thanks very much!

Fu-chang · December 23, 2009, 3:09pm

Way 1:

define AIR(x) (IR(x)&0x7fffffff)

define IR(x) ((udword&)(x))

kernel code:

udword tx = threadIdx.x;

udword ty = threadIdx.y;

udword bw = blockDim.x;

udword bh = blockDim.y;

udword tid = __umul24(ty, bw) + tx;

udword bid = __umul24(__umul24(blockIdx.x, bw), bh);

udword th_id = bid + tid;

shared float3 pairsCen[512];

shared float3 pairsExt[512];

…

…

unsigned int *index0 = &d_Sorted[objID];

float3 cen0 = pairsCen[index0];

float3 extent0 = pairsExt[index0];

unsigned int *index = index0 + 1;

udword offset = 0;

i=0;

while(i<32)

{

float3 cen = pairsCen[index];

float3 extent = pairsExt[index];

float t1 = cen0.y-cen.y;

float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

float t2 = cen0.z-cen.z;

float e2 = extent0.z+extent.z;

if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

d_mulpairs[th_id+offset*totalThreadNum] = *(start_index + index)+1;

offset++;

}

}

index++;

i++;

}

Way2:

define AIR(x) (IR(x)&0x7fffffff)

define IR(x) ((udword&)(x))

kernel code:

udword tx = threadIdx.x;

udword ty = threadIdx.y;

udword bw = blockDim.x;

udword bh = blockDim.y;

udword tid = __umul24(ty, bw) + tx;

udword bid = __umul24(__umul24(blockIdx.x, bw), bh);

udword th_id = bid + tid;

shared float3 pairsCen[512];

shared float3 pairsExt[512];

…

…

unsigned int *index0 = &d_Sorted[objID];

float3 cen0 = pairsCen[index0];

float3 extent0 = pairsExt[index0];

unsigned int *index = index0 + 1;

udword offset = 0;

udword indexBits = 0;

udword mask = 1;

i=0;

while(i<32)

{

float t1 = cen0.y-cen.y;

float e1 = extent0.y+extent.y;

if(AIR(t1) <= IR(e1)) //(fabs(t1)<=e1)

{

float t2 = cen0.z-cen.z;

float e2 = extent0.z+extent.z;

if(AIR(t2) <= IR(e2)) //(fabs(t2)<=e2)

{

indexBits = indexBits | mask << offset;

offset++;

}

}

index++;

i++;

}

__syncthreads();

d_mulpairs[th_id] = indexBits;

In way 1, writing to global memory is random; While writing to global memory is contiguous in way 2. So, way 2 should be faster than way 1.

But the timing results are similiar. I wonder the reason. It doesn’t make sense. And, I check the PTX code and find all the local variables stored in registers.

But register is fast for access. Anybody who can give me some advice? Thanks very much!

Is the reason of register latency for read after write?

It is best to schedule at least 192 (active) threads per SM. since 16 threads operate in one clock,

so 12*16=192 threads will cover this 12 clock latency. )

But in my case, I have already used 512 thread and 256 active threads per SM.

Why I still have a long register latency?

Topic		Replies	Views
Why the timings of these two ways are similar? one is writing randomly, the other is writing contigo CUDA Programming and Performance	6	7218	January 6, 2010
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25880	July 5, 2007
Writing global memory 14 times slower than reading? CUDA Programming and Performance	6	10198	January 19, 2011
read from global mem vs write to global mem CUDA Programming and Performance	13	6538	January 22, 2009
global reading vs writing latency CUDA Programming and Performance	3	3724	March 23, 2007
strage low of writing global mem CUDA Programming and Performance	5	2207	February 22, 2012
Writing to Global Memory CUDA Programming and Performance	0	2421	February 13, 2009
How to write efficient from local to glocal memory Writing - time problems CUDA Programming and Performance	3	5571	December 5, 2007
Speed-Loss by Writing to Global Mem CUDA Programming and Performance	3	2080	March 31, 2008
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8030	July 20, 2007

Why no obvious difference between two ways? Writing to global memory

Related topics