__shared__ memory confused me. __shared__ memory

The NVIDIA’s said in 3.2 that Global,local,and texture memory have the greatest access latency,followed by constant memory,registers,and shared memory.

but , check this out:

1.the first time

global void test()

{
shared int data[4]; //shared memory
int count=0;
for(;count<512;count++)
{
data[0]=count;
}
}

it spent 0.037952 milsecond.

2.the second

global void test()

{
int data[4]; //not shared memory any more
int count=0;
for(;count<512;count++)
{
data[0]=count;
}
}

and this time,it only take 0.007008 milsecond.Is that the shared memory have the lowest latency? what happend? any idea? thanks!

the full code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cutil_inline.h>
#include <template_kernel.cu>
#define SIZE 1048577

global void test()

{
shared int data[4];

int count=0;
for(;count<512;count++)
{

data[0]=count;

}

}

void main()

{
cudaEvent_t start ,stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);

test<<<1,1>>>(); //one theard ,is there any chance cause bank conflict?

cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf(“%f\n”,time);
}

Your comparation method is not reasonable.
The compiler will ignore many non-sense variables and operations.

Open64 has a very aggressive dead code removal algorithm. I am willing to bet that your second kernel is compiled to an empty stub which does absolutely nothing.

thanks ,you guys . I will try to find some more reasonable way to proof it to myself.

Anyway, data[4] is only used non-dynamically indexed: data[0]

In this case nvcc may optimize it to be placed on Scalar Processor Registers, so it’s irrelevant, you may do it like that to test local memory :
int cx = 0;

data[cx] = …
cx ^= 1; // To alternate between 0 and 1 as simply as possible (1 GPU cycle), and for use of Local Memory

Problem solved .

After I passing a argument to the kernel to save the sum ,and check it out in CPU,the complier finally think the job is not a meaningless one any more.

At the same time, Using the dynamically index of data array is also important for this test,thanks iAPX.

shared:

shared.gif

locak:

local.gif

test.txt (878 Bytes)

You results for local memory is surprisingly fast, I wonder if you are

  • launching enough threads to saturate memory bus
    (because one thread writing at the same exact place isnt 768 threads writing on many places!)
  • using a data array that is larger, ie: data[4096] with cx=(cx+1) & 4095;
    to force writing to go on different “local” memory place

I think that you should try because Local Memory is really really slow!

thanks , I am a newbie ,thanks for advice.