The NVIDIA’s said in 3.2 that Global,local,and texture memory have the greatest access latency,followed by constant memory,registers,and shared memory.
but , check this out:
1.the first time
global void test()
{ shared int data[4]; //shared memory
int count=0;
for(;count<512;count++)
{
data[0]=count;
}
}
it spent 0.037952 milsecond.
2.the second
global void test()
{
int data[4]; //not shared memory any more
int count=0;
for(;count<512;count++)
{
data[0]=count;
}
}
and this time,it only take 0.007008 milsecond.Is that the shared memory have the lowest latency? what happend? any idea? thanks!
Open64 has a very aggressive dead code removal algorithm. I am willing to bet that your second kernel is compiled to an empty stub which does absolutely nothing.
Anyway, data[4] is only used non-dynamically indexed: data[0]
In this case nvcc may optimize it to be placed on Scalar Processor Registers, so it’s irrelevant, you may do it like that to test local memory :
int cx = 0;
…
data[cx] = …
cx ^= 1; // To alternate between 0 and 1 as simply as possible (1 GPU cycle), and for use of Local Memory
After I passing a argument to the kernel to save the sum ,and check it out in CPU,the complier finally think the job is not a meaningless one any more.
At the same time, Using the dynamically index of data array is also important for this test,thanks iAPX.