Local memory size

Hi every one around here.
Does any one know what is the size of local memory.
In Nvidia_CUDA_Programming_Guide doesn’t tell any thing about local memory size.

thank you very much.

Local memory is located in global memory space, so you’re limited by your on-card RAM, likely 256, 512, or 1024 megabytes.

Thank SPWorley.

Geforce8800GT, global memory is 512MB.

I use 100MB for global memory to allocate my data. In my program I have total is 2560 threads, and it is mean that the maximum local memory for 1 thread

is (512MB-100MB)/2560. Is it right?

this is my sample program. please help me.

[codebox]global void (short *globalData)


unsigned int tx = blockDim.x * blockIdx.x + threadIdx.x;

//define array in local memory 10KB per thread,

short localData[1024 * 5];	

for ( int i = 0; i < 1024 * 5; i++) {

	localData[i] = globalData[i];


////process some thing 


Define array in local memory 10KB per thread,

Total thread is 2560 threads

Total Local memory need is 2560 * 10KB = 25MB

I used the geforce8800GT so global memory is 512MB.

In global memory, I used 2MB to allocate global data. copy data from host to device

But my program can not work. compiler can not generate object file [*.obj].

It works fine.

I had somethings wrong in processing code.


I don’t know what you are trying to do but it looks very wrong :blink: , “local memory” is slow, since it is really global memory, so the code you wrote would be very very slow, and i dont know if the compiler even tries to coals local memory reads and writes…

Thank erdooom.

I know that to access data in “local memory” and “global memory” is very slow, and may be incoherence.

But some time we must use it, because we have no way to do.

This is just my tutorial.

The answer is yes and no.

Most of the time you don’t need that much local memory, if you in the situation that use a lot of local memory, you should redesign your algorithm. If not, then normally you can not see any improvement over CPU version, instead of more complication.

From you example, your gather function is not coalesced, all threads read exactly the same input and from my own experience it is slower than coalesced read, why don’t you use the share memory as temporal place and latter write to the your local memory so that it have coalesced read and write, that would be much faster.

Thank Linh Ha.

Most of the time we don’t need to use local memory, because program will run too slow and accesses data (read,write) may be uncoalesced.

So some time we must use it be cause we need more than 16KB per block. if only for read we can use texture or constant memory.

This is my tutorial to confirm that the size of “local memory” is as big as the size of “global memory”. And “local memory” is locate in “global memory”(physics) but scope is local for threads(logic).

So in this situtation, my answer is yes. I need to use local memory although uncoalesced accesses ( for my tutorial).

Yes. shared memory is the best way to read/write data if your we don’t get bank conflicts. But the size of shared memory is 16KB/Block, so some time it is not enough for our kernel.

Thank you very much. :)