memory confusion how big is local/shared/global memory?

Louis_Coder · December 31, 2008, 8:21am

First thanks to all people who answered my last questions.

Now a new question:

Have a look at the attached screenshot.

Now let’s consider I have a computer with the following specifications:

-RAM: 640 MB

-Graphics card: GeForce 8400 GS 512 MB

Ok that’s not the best hardware but let’s use this one by now.

Now my question:

How big is the global memory (MB), how big the shared one and how big is the

local memory?

As far as I have understood the CUDA documentation, you should work on shared

memory as this is the fastest one. But, how much shared memory do I have?

Can I load data from the RAM directly into the shard memory?

In the matrix example (NVIDIA_CUDA_Programming_Guide_2.0.pdf, page

81 (according to Adobe Viewer’s page index)), the CUDA authors do the following:

[codebox]

// Load A and B to the device

float* Ad;

size = hA * wA * sizeof(float);

cudaMalloc((void**)&Ad, size);

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);

float* Bd;

size = wA * wB * sizeof(float);

cudaMalloc((void**)&Bd, size);

cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

[…]

// Loop over all the sub-matrices of A and B required to

// compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A

shared float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

shared float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to shared memory;

// each thread loads one element of each matrix

As[ty][tx] = A[a + wA * ty + tx];

Bs[ty][tx] = B[b + wB * ty + tx];[/codebox]

So do I have to do two memory copies to get my memory into shared memory?

Doesn’t that work from global to shared memory directly?

Thanks!!!

E.D_Riedijk · December 31, 2008, 10:21am

First thanks to all people who answered my last questions.

Now a new question:

Have a look at the attached screenshot.

Now let’s consider I have a computer with the following specifications:

-RAM: 640 MB

-Graphics card: GeForce 8400 GS 512 MB

Ok that’s not the best hardware but let’s use this one by now.

Now my question:

How big is the global memory (MB), how big the shared one and how big is the

local memory?

As far as I have understood the CUDA documentation, you should work on shared

memory as this is the fastest one. But, how much shared memory do I have?

Can I load data from the RAM directly into the shard memory?

In the matrix example (NVIDIA_CUDA_Programming_Guide_2.0.pdf, page

81 (according to Adobe Viewer’s page index)), the CUDA authors do the following:

[codebox]

// Load A and B to the device

float* Ad;

size = hA * wA * sizeof(float);

cudaMalloc((void**)&Ad, size);

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);

float* Bd;

size = wA * wB * sizeof(float);

cudaMalloc((void**)&Bd, size);

cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

[…]

// Loop over all the sub-matrices of A and B required to

// compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A

shared float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

shared float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to shared memory;

// each thread loads one element of each matrix

As[ty][tx] = A[a + wA * ty + tx];

Bs[ty][tx] = B[b + wB * ty + tx];[/codebox]

So do I have to do two memory copies to get my memory into shared memory?

Doesn’t that work from global to shared memory directly?

Thanks!!!

No, you need to load data to shared memory yourself. Please read the programming guide, it is quite ok and explains all of this. Also the fact that you have 16 kB of shared memory per multiprocessor is in the guide.

Louis_Coder · December 31, 2008, 12:54pm

Ok, 16 KB.
But can’t it be that the copying plus work in shared memory is slower as if accessing the device memory without copying (when you don’t do too many operations)? How can I know what is faster - no copying and accessing device memory/copying and accessing shared memory.
Sorry I’ve read the programming guide but I like better asking other human beings as they often know more than the combination me and the programming guide ;)

Louis_Coder · December 31, 2008, 12:57pm

Sorry I didn’t have a look into the programming guide again but how much is the global memory?
Is it the 640 MB or the 512 MB?
If it is the card memory, must I subtract shared memory from the total 512 MB to get the global memory?
Or is the shared memory within the GPU?

E.D_Riedijk · December 31, 2008, 1:11pm

really, the programming guide explains quite okay how the hardware is organized. It will really help you understand how it works. You can ask lots of questions, but you will still not cover the programming guide. If things in the guide are not clear, you can still ask here.

As to when to use shared memory: when you need to use it as a fast cache (like the matrixmul example, where each element is accessed multiple times), or when you need to have threads cooperate to achieve a result (like in the reduction example) are good examples as to when to use it.

User_CUDA · January 20, 2009, 3:41am

So what will be the size of the data that could be loaded on to the device. If the size of the constant memory is 64kb, then the maximum size that could be loaded is 64kb and if i have to load data that is higher I will have to load it in chunks, is it? .also is there a way in which the texture memory could be put to use for general use, if my program does not involve any use of the texture memory?

E.D_Riedijk · January 20, 2009, 7:31am

texture memory is general memory, there is just a cache in between (that can only be used for textures)

figure 3-1 of the programming guide shows the way it is organized

Topic		Replies	Views
About the different memories CUDA Programming and Performance	12	11671	December 6, 2007
CUDA texture memory performance CUDA Programming and Performance	4	33585	January 13, 2009
Device memory size CUDA Programming and Performance	11	46872	June 6, 2008
memory size how can i know the size of the different memories? CUDA Programming and Performance	6	6113	November 4, 2009
texture memory vs global memory CUDA Programming and Performance	10	13753	August 20, 2007
Help with some CUDA concepts CUDA Programming and Performance	7	1448	August 16, 2009
Local memory size CUDA Programming and Performance	8	7736	November 14, 2008
Memory types and CUDA access CUDA Programming and Performance	5	59244	February 3, 2009
Confusion on using texture? CUDA Programming and Performance	14	4936	September 4, 2009
texture memory limt CUDA Programming and Performance	7	7864	July 30, 2009

memory confusion how big is local/shared/global memory?

Related topics