Local memory size

Quoc_Vinh · November 11, 2008, 10:04am

Hi every one around here.
Does any one know what is the size of local memory.
In Nvidia_CUDA_Programming_Guide doesn’t tell any thing about local memory size.

thank you very much.

SPWorley · November 11, 2008, 10:09am

Local memory is located in global memory space, so you’re limited by your on-card RAM, likely 256, 512, or 1024 megabytes.

Quoc_Vinh · November 11, 2008, 1:13pm

Thank SPWorley.

Geforce8800GT, global memory is 512MB.

I use 100MB for global memory to allocate my data. In my program I have total is 2560 threads, and it is mean that the maximum local memory for 1 thread

is (512MB-100MB)/2560. Is it right?

Quoc_Vinh · November 12, 2008, 1:30am

this is my sample program. please help me.

[codebox]global void (short *globalData)

{

unsigned int tx = blockDim.x * blockIdx.x + threadIdx.x;

//define array in local memory 10KB per thread,

short localData[1024 * 5];	

for ( int i = 0; i < 1024 * 5; i++) {

	localData[i] = globalData[i];

}

////process some thing

}[/codebox]

Define array in local memory 10KB per thread,

Total thread is 2560 threads

Total Local memory need is 2560 * 10KB = 25MB

I used the geforce8800GT so global memory is 512MB.

In global memory, I used 2MB to allocate global data. copy data from host to device

But my program can not work. compiler can not generate object file [*.obj].

Quoc_Vinh · November 12, 2008, 9:31am

this is my sample program. please help me.

[codebox]global void (short *globalData)

{
unsigned int tx = blockDim.x * blockIdx.x + threadIdx.x;

//define array in local memory 10KB per thread,

short localData[1024 * 5];	

for ( int i = 0; i < 1024 * 5; i++) {

	localData[i] = globalData[i];

}

////process some thing 
}[/codebox]

Define array in local memory 10KB per thread,

Total thread is 2560 threads

Total Local memory need is 2560 * 10KB = 25MB

I used the geforce8800GT so global memory is 512MB.

In global memory, I used 2MB to allocate global data. copy data from host to device

But my program can not work. compiler can not generate object file [*.obj].

It works fine.

I had somethings wrong in processing code.

:rolleyes:

erdooom · November 12, 2008, 3:33pm

I don’t know what you are trying to do but it looks very wrong :blink: , “local memory” is slow, since it is really global memory, so the code you wrote would be very very slow, and i dont know if the compiler even tries to coals local memory reads and writes…

Quoc_Vinh · November 13, 2008, 12:42am

Thank erdooom.

I know that to access data in “local memory” and “global memory” is very slow, and may be incoherence.

But some time we must use it, because we have no way to do.

This is just my tutorial.

Linh_Ha · November 13, 2008, 1:11am

The answer is yes and no.

Most of the time you don’t need that much local memory, if you in the situation that use a lot of local memory, you should redesign your algorithm. If not, then normally you can not see any improvement over CPU version, instead of more complication.

From you example, your gather function is not coalesced, all threads read exactly the same input and from my own experience it is slower than coalesced read, why don’t you use the share memory as temporal place and latter write to the your local memory so that it have coalesced read and write, that would be much faster.

Quoc_Vinh · November 14, 2008, 1:17am

Thank Linh Ha.

Most of the time we don’t need to use local memory, because program will run too slow and accesses data (read,write) may be uncoalesced.

So some time we must use it be cause we need more than 16KB per block. if only for read we can use texture or constant memory.

This is my tutorial to confirm that the size of “local memory” is as big as the size of “global memory”. And “local memory” is locate in “global memory”(physics) but scope is local for threads(logic).

So in this situtation, my answer is yes. I need to use local memory although uncoalesced accesses ( for my tutorial).

Yes. shared memory is the best way to read/write data if your we don’t get bank conflicts. But the size of shared memory is 16KB/Block, so some time it is not enough for our kernel.

Thank you very much. :)

Topic		Replies	Views
Local memory usage CUDA Programming and Performance	5	1200	July 9, 2010
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5081	September 6, 2008
temporary memory issues CUDA Programming and Performance	11	5321	March 30, 2008
Local Memory Per Thread ? CUDA Programming and Performance	5	4368	June 4, 2010
memory confusion how big is local/shared/global memory? CUDA Programming and Performance	6	3433	January 20, 2009
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24609	December 21, 2009
memory size how can i know the size of the different memories? CUDA Programming and Performance	6	6113	November 4, 2009
Thread Local Memory CUDA Programming and Performance	1	6927	January 26, 2016
Local Memory and Global Memory It is about the speed between local memory and global memory CUDA Programming and Performance	1	1030	February 7, 2012
questions on register, local memory and block CUDA Programming and Performance	5	4887	February 28, 2008

Local memory size

Related topics