About the different memories

Marsxema · November 18, 2007, 7:22pm

Hi everybody,

I have searched informations about the different memories but I found nothing in the documentation (maybe I didn’t look at the good place).
My question is : what is the size of the different memories (shared, local and global) and the access time of these memories?

Thanks for your answers.

AndreiB · November 18, 2007, 7:49pm

It’s on the docs, look better =)

Shared memory is 16 Kilobytes per multiprocessor. It’s fast as long as there’s no bank conflicts. See Programming Manual on this.

Amount of global memory depends on card used, it may be 256, 512, 768 MB or 1,5 GB (Tesla). But keep in mind that you cannot utilize all that memory. Display driver requres some memory for storing display buffer and probably CUDA runtime has some memory footprint too.

Globaly memory is slow, it has latency of 400-600 clock cycles. However running many threads effictively hides this latency.

Local memory is same as global (slow…) but it’s allocated per thread.

It’s all in Programming Manual. Read it carefully. This may save you from re-writing poorly performing code later.

Marsxema · November 20, 2007, 7:22pm

Thanks… I must buy new eyes I think :).

And about the local memory, do we have an ideo of its size ?

seibert · November 21, 2007, 12:20am

Local memory is a portion of the global memory which is assigned to each thread. Local memory can therefore be very large, but it is also very slow.

Marsxema · November 26, 2007, 6:17pm

I have problems to make the difference between global and device memory. Does the device memory depends on the global memory ?

And for the others memories (shared, texture, constant), are they created(allocated) from the global or independant ? (or virtual ?)

And in the programming guide, I have read something about the constant cache : what is the difference between the cache and the constant memory ? (size?access time? location).

All this notions are a bit abstract for me and with a friend we are not sure to understand the programming guide so I prefer ask you.

Thanks for your futur answers.

AndreiB · November 26, 2007, 6:49pm

global memory and device memory are just different names of same memory.

shared memory is on-chip (not part of global memory). textures and constant memory reside in global memory space.

constant cache is exatly what it claims to be: constant cache =) It allows for fast access to commonly used constant values. It’s size is 8Kb per multiprocessor if I remember correctly. From programmer’s point of view there’s no difference between constant cache and constant memory: you may use only constant memory and accessed values are cached automatically. Size of constant memory is 64Kb.

Marsxema · November 26, 2007, 11:13pm

Ok now I think I understand memories’ mecanisms :).
Thanks for your answers

Mark_Harris · November 29, 2007, 11:33am

It’s fast even if there are bank conflicts. Even with 16-way bank conflicts, shared memory is dozens of times faster than gobal memory.

I say this because many people get too worried about bank conflicts. Optimize for bank conflicts last, especially if they are only 2- or 4-way conflicts, which may take more instructions to optimize away than they cost anyway.

The most important optimization is usually getting your global memory accesses coalesced.

Mark

Marsxema · December 5, 2007, 2:17pm

My last question is to know the size of registers ? I have read some infos about the maximum number by multiprocessor, about bank conflict, but I see anywhere their size !

So what’s the size of registers ? And when do we use it ? Do we use it in order to stock variables ?

wumpus · December 5, 2007, 3:47pm

maximum number of registers per thread is 128, but usually you want to use as little as possible, preferably around 10-20. Otherwise, the occupancy becomes very low.

Marsxema · December 5, 2007, 5:38pm

Okay. I already had understood this point. But I want to know their size and to understand in which case it is more interessant to use theses registers !

wumpus · December 6, 2007, 12:17am

The total register bank size is 8192 * 32 bit registers, or 32k

Sarnath · December 6, 2007, 3:19am

Any memory that resides on the device is “device memory”. This memory is inside your PCI-X/E NVIDIA card. The PTX-ISA document that I downloaded from UIUC site differentiates memory into 2 types – “Host memory” , “device memory”. “Host memory” is the RAM available to the CPU of your system. The “device memory” encompasses “Global Memory”, “Shared Memory”, “Local memory”, “Texture memory”, “Constant Memory” and so on.

Nope. Global is a big chunk of memory which mainly serves as the “Frame buffer” for the graphics device. The Graphics card has other sections of memory like “Texture”, “Constant” memory and so on. Note that the “Texture” memory is writable by the Host CPU. But the GPU cannot write into it. It is “read-only” to the GPU.

Now, the memories are not equi-distant from the GPU cores (or) the ALUs which are the main compute engine of the GPU. The “Global memory” or the frame buffer is very slow from the perspective of GPU. A good programmer must schedule enough threads so that when one block is stalled on a global memory access – the multiprocessor could switch to another block which does “computation”. That is how you acheive performance.

Also, Mark was referring to “Coalescing” global memory accesses. I am not sure what it is. May b, if you look into manual , you would know better.

The “texture” and “constant” appear faster because the ALUs have caches for both of them. Note that the ALU does NOT have cache for “global memory” or “frame buffer”.

I have no idea what you mean by “virtual”. I think you are having a CPU’s MMU hang-over. All these memories are within the GPU and are seen by the GPU’s computation units. There r no page-tables or anything for these memories from the GPU’s standpoint.

Just the same difference as your “L1 cache and Main Memory(RAM)”. I have no data on the size and access time of the caches.

Please note that I am relatively new to CUDA. So, I might have erred somewhere. I hope that the knowledgeable people in this forum would correct it, if there r any errors.

Topic		Replies	Views
memory confusion how big is local/shared/global memory? CUDA Programming and Performance	6	3433	January 20, 2009
memory organization CUDA Programming and Performance	3	4335	March 10, 2008
memory size how can i know the size of the different memories? CUDA Programming and Performance	6	6113	November 4, 2009
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24588	December 21, 2009
questions on register, local memory and block CUDA Programming and Performance	5	4887	February 28, 2008
Thread Local Memory CUDA Programming and Performance	1	6898	January 26, 2016
Where best to allocate memory On the local stack or in shared memory CUDA Programming and Performance	11	5421	January 26, 2009
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4391	December 4, 2009
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4720	June 10, 2008
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7552	July 21, 2008

About the different memories

Related topics