Local Memory and Global Memory It is about the speed between local memory and global memory

Hi Everyone!

I have read in Cuda Programing Guide and many Forums that the CUDA Local Memory is a place of Global Memory, and the speed is the same to write/access both of them.

However, I wrote a code that when I’m using Global Memory, the program takes 8 seconds more than I’m using Local Memory (that run in 2 seconds) in my Tesla.

Take a look of mey code. I’m using 256 threads to each block, and 120 blocks.

My doubt is why the program runs with local memory faster than with global memory. For what I have read, should be the same time.

Some ideia?

#define SIZE 30000*2

#define ARRAY_SIZE 500

typedef struct {

int array[ARRAY_SIZE]; 


//Comment the next line to use the Local Memory

device array_t data;

global void lazy_supervised_classification( )


 //Comment the next line to use the Global Memory

//array_t data;

int idx = blockIdx.x*blockDim.x + threadIdx.x;	

int max = 10;

int indice=0;

int ind=idx*2;

int temp=0;

    //Theses nested loop just run to wast time




        for(int i=0; i<30 ; i++)


            for(int i=0;i<ARRAY_SIZE ; i++)


                temp= (temp/2)/5;


                data[ind].array[i]=data[ind+1].array[i] * l;





Your example with local memory does not run at all. It would require about 115 MB of memory per thread, or roughly 29 GB to even run a single block.

Either CUDA returns errors which you do not see (because you do not check return codes), or the compiler has optimized away the whole calculation including the local memory, because it has no globally visible effect. I’d put my bets on the latter.

I’m a bit surprised the program still takes two seconds to run, but that might just be your general startup time.