I have searched informations about the different memories but I found nothing in the documentation (maybe I didn’t look at the good place).
My question is : what is the size of the different memories (shared, local and global) and the access time of these memories?
Shared memory is 16 Kilobytes per multiprocessor. It’s fast as long as there’s no bank conflicts. See Programming Manual on this.
Amount of global memory depends on card used, it may be 256, 512, 768 MB or 1,5 GB (Tesla). But keep in mind that you cannot utilize all that memory. Display driver requres some memory for storing display buffer and probably CUDA runtime has some memory footprint too.
Globaly memory is slow, it has latency of 400-600 clock cycles. However running many threads effictively hides this latency.
Local memory is same as global (slow…) but it’s allocated per thread.
It’s all in Programming Manual. Read it carefully. This may save you from re-writing poorly performing code later.
I have problems to make the difference between global and device memory. Does the device memory depends on the global memory ?
And for the others memories (shared, texture, constant), are they created(allocated) from the global or independant ? (or virtual ?)
And in the programming guide, I have read something about the constant cache : what is the difference between the cache and the constant memory ? (size?access time? location).
All this notions are a bit abstract for me and with a friend we are not sure to understand the programming guide so I prefer ask you.
global memory and device memory are just different names of same memory.
shared memory is on-chip (not part of global memory). textures and constant memory reside in global memory space.
constant cache is exatly what it claims to be: constant cache =) It allows for fast access to commonly used constant values. It’s size is 8Kb per multiprocessor if I remember correctly. From programmer’s point of view there’s no difference between constant cache and constant memory: you may use only constant memory and accessed values are cached automatically. Size of constant memory is 64Kb.
It’s fast even if there are bank conflicts. Even with 16-way bank conflicts, shared memory is dozens of times faster than gobal memory.
I say this because many people get too worried about bank conflicts. Optimize for bank conflicts last, especially if they are only 2- or 4-way conflicts, which may take more instructions to optimize away than they cost anyway.
The most important optimization is usually getting your global memory accesses coalesced.
My last question is to know the size of registers ? I have read some infos about the maximum number by multiprocessor, about bank conflict, but I see anywhere their size !
So what’s the size of registers ? And when do we use it ? Do we use it in order to stock variables ?
maximum number of registers per thread is 128, but usually you want to use as little as possible, preferably around 10-20. Otherwise, the occupancy becomes very low.
Okay. I already had understood this point. But I want to know their size and to understand in which case it is more interessant to use theses registers !
Any memory that resides on the device is “device memory”. This memory is inside your PCI-X/E NVIDIA card. The PTX-ISA document that I downloaded from UIUC site differentiates memory into 2 types – “Host memory” , “device memory”. “Host memory” is the RAM available to the CPU of your system. The “device memory” encompasses “Global Memory”, “Shared Memory”, “Local memory”, “Texture memory”, “Constant Memory” and so on.
Nope. Global is a big chunk of memory which mainly serves as the “Frame buffer” for the graphics device. The Graphics card has other sections of memory like “Texture”, “Constant” memory and so on. Note that the “Texture” memory is writable by the Host CPU. But the GPU cannot write into it. It is “read-only” to the GPU.
Now, the memories are not equi-distant from the GPU cores (or) the ALUs which are the main compute engine of the GPU. The “Global memory” or the frame buffer is very slow from the perspective of GPU. A good programmer must schedule enough threads so that when one block is stalled on a global memory access – the multiprocessor could switch to another block which does “computation”. That is how you acheive performance.
Also, Mark was referring to “Coalescing” global memory accesses. I am not sure what it is. May b, if you look into manual , you would know better.
The “texture” and “constant” appear faster because the ALUs have caches for both of them. Note that the ALU does NOT have cache for “global memory” or “frame buffer”.
I have no idea what you mean by “virtual”. I think you are having a CPU’s MMU hang-over. All these memories are within the GPU and are seen by the GPU’s computation units. There r no page-tables or anything for these memories from the GPU’s standpoint.
Just the same difference as your “L1 cache and Main Memory(RAM)”. I have no data on the size and access time of the caches.
Please note that I am relatively new to CUDA. So, I might have erred somewhere. I hope that the knowledgeable people in this forum would correct it, if there r any errors.