shared memory access time?

I made a kernel to access shared memory.
I executed it 5 times.

Access time

(1) 0.180000 ms
(2) 0.021000 ms
(3) 0.014000 ms
(4) 0.014000 ms
(5) 0.014000 ms

I think that shared memory access time must be same.
I don’t know this reason.

[source code]
global void speed_check(…)
{
[indent]shared int s_m[1024];

[indent]for ( i = 0; i < 1024; i++ )
{
tmp += s_m[i];
}[/indent]
…[/indent]
}

int main()
{
[indent]…
[indent]for ( i = 0; i < 5; i++ )
{
cutStartTimer(timer[i]);
speed_check<<< 1, 1 >>>(…);
cudaThreadSynchronize();
cutStopTimer(timer[i]);
}[/indent]
…[/indent]}

The resolution of the timer is likely not accurate down to microseconds. Try tests that are about 100X more intensive, that’s still only a second or so.

Also you didn’t post your source code but if it’s essentially what you have there in your fragment, the optimizer might realize you’re not doing any net work and therefore optimize your whole kernel into an empty NOOP.
Some fake final dependence like if (tmp==0x1234567) *somedevicemem=1; would be enough even though the tmp compare won’t succeed… the compiler won’t know that so it won’t optimize out your compute.