Hi Everyone!
I have read in Cuda Programing Guide and many Forums that the CUDA Local Memory is a place of Global Memory, and the speed is the same to write/access both of them.
However, I wrote a code that when I’m using Global Memory, the program takes 8 seconds more than I’m using Local Memory (that run in 2 seconds) in my Tesla.
Take a look of mey code. I’m using 256 threads to each block, and 120 blocks.
My doubt is why the program runs with local memory faster than with global memory. For what I have read, should be the same time.
Some ideia?
#define SIZE 30000*2
#define ARRAY_SIZE 500
typedef struct {
int array[ARRAY_SIZE];
}array_t;
//Comment the next line to use the Local Memory
device array_t data;
global void lazy_supervised_classification( )
{
//Comment the next line to use the Global Memory
//array_t data;
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int max = 10;
int indice=0;
int ind=idx*2;
int temp=0;
//Theses nested loop just run to wast time
while(max>indice)
{
indice++;
for(int i=0; i<30 ; i++)
{
for(int i=0;i<ARRAY_SIZE ; i++)
{
temp= (temp/2)/5;
data[ind+1].array[i]=1;
data[ind].array[i]=data[ind+1].array[i] * l;
}
}
}
}