shared memory confused me. shared memory

afdsfg · July 29, 2009, 8:44am

The NVIDIA’s said in 3.2 that Global,local,and texture memory have the greatest access latency,followed by constant memory,registers,and shared memory.

but , check this out:

1.the first time

global void test()

{
shared int data[4]; //shared memory
int count=0;
for(;count<512;count++)
{
data[0]=count;
}
}

it spent 0.037952 milsecond.

2.the second

global void test()

{
int data[4]; //not shared memory any more
int count=0;
for(;count<512;count++)
{
data[0]=count;
}
}

and this time,it only take 0.007008 milsecond.Is that the shared memory have the lowest latency? what happend? any idea? thanks!

the full code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cutil_inline.h>
#include <template_kernel.cu>
#define SIZE 1048577

global void test()

{
shared int data[4];

int count=0;
for(;count<512;count++)
{

data[0]=count;

}

void main()

{
cudaEvent_t start ,stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);

test<<<1,1>>>(); //one theard ,is there any chance cause bank conflict?

cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf(“%f\n”,time);
}

iceberg · July 29, 2009, 9:06am

Your comparation method is not reasonable.
The compiler will ignore many non-sense variables and operations.

avidday · July 29, 2009, 9:11am

Open64 has a very aggressive dead code removal algorithm. I am willing to bet that your second kernel is compiled to an empty stub which does absolutely nothing.

afdsfg · July 29, 2009, 1:19pm

thanks ,you guys . I will try to find some more reasonable way to proof it to myself.

parallelis · July 29, 2009, 7:01pm

Anyway, data[4] is only used non-dynamically indexed: data[0]

In this case nvcc may optimize it to be placed on Scalar Processor Registers, so it’s irrelevant, you may do it like that to test local memory :
int cx = 0;

…

data[cx] = …
cx ^= 1; // To alternate between 0 and 1 as simply as possible (1 GPU cycle), and for use of Local Memory

afdsfg · July 30, 2009, 3:32am

Problem solved .

After I passing a argument to the kernel to save the sum ,and check it out in CPU,the complier finally think the job is not a meaningless one any more.

At the same time, Using the dynamically index of data array is also important for this test,thanks iAPX.

shared:

locak:

test.txt (878 Bytes)

parallelis · July 30, 2009, 5:23pm

You results for local memory is surprisingly fast, I wonder if you are

launching enough threads to saturate memory bus
(because one thread writing at the same exact place isnt 768 threads writing on many places!)
using a data array that is larger, ie: data[4096] with cx=(cx+1) & 4095;
to force writing to go on different “local” memory place

I think that you should try because Local Memory is really really slow!

afdsfg · August 1, 2009, 4:57am

thanks , I am a newbie ,thanks for advice.

__shared__ memory confused me. __shared__ memory

shared memory confused me. shared memory