memory confusion how big is local/shared/global memory?

First thanks to all people who answered my last questions.

Now a new question:

Have a look at the attached screenshot.

Now let’s consider I have a computer with the following specifications:

-RAM: 640 MB

-Graphics card: GeForce 8400 GS 512 MB

Ok that’s not the best hardware but let’s use this one by now.

Now my question:

How big is the global memory (MB), how big the shared one and how big is the

local memory?

As far as I have understood the CUDA documentation, you should work on shared

memory as this is the fastest one. But, how much shared memory do I have?

Can I load data from the RAM directly into the shard memory?

In the matrix example (NVIDIA_CUDA_Programming_Guide_2.0.pdf, page

81 (according to Adobe Viewer’s page index)), the CUDA authors do the following:

[codebox]

// Load A and B to the device

float* Ad;

size = hA * wA * sizeof(float);

cudaMalloc((void**)&Ad, size);

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);

float* Bd;

size = wA * wB * sizeof(float);

cudaMalloc((void**)&Bd, size);

cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

[…]

// Loop over all the sub-matrices of A and B required to

// compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A

shared float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

shared float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to shared memory;

// each thread loads one element of each matrix

As[ty][tx] = A[a + wA * ty + tx];

Bs[ty][tx] = B[b + wB * ty + tx];[/codebox]

So do I have to do two memory copies to get my memory into shared memory?

Doesn’t that work from global to shared memory directly?

Thanks!!!
CUDAMemory.png

No, you need to load data to shared memory yourself. Please read the programming guide, it is quite ok and explains all of this. Also the fact that you have 16 kB of shared memory per multiprocessor is in the guide.

Ok, 16 KB.
But can’t it be that the copying plus work in shared memory is slower as if accessing the device memory without copying (when you don’t do too many operations)? How can I know what is faster - no copying and accessing device memory/copying and accessing shared memory.
Sorry I’ve read the programming guide but I like better asking other human beings as they often know more than the combination me and the programming guide ;)

Sorry I didn’t have a look into the programming guide again but how much is the global memory?
Is it the 640 MB or the 512 MB?
If it is the card memory, must I subtract shared memory from the total 512 MB to get the global memory?
Or is the shared memory within the GPU?

really, the programming guide explains quite okay how the hardware is organized. It will really help you understand how it works. You can ask lots of questions, but you will still not cover the programming guide. If things in the guide are not clear, you can still ask here.

As to when to use shared memory: when you need to use it as a fast cache (like the matrixmul example, where each element is accessed multiple times), or when you need to have threads cooperate to achieve a result (like in the reduction example) are good examples as to when to use it.

So what will be the size of the data that could be loaded on to the device. If the size of the constant memory is 64kb, then the maximum size that could be loaded is 64kb and if i have to load data that is higher I will have to load it in chunks, is it? .also is there a way in which the texture memory could be put to use for general use, if my program does not involve any use of the texture memory?

texture memory is general memory, there is just a cache in between (that can only be used for textures)

figure 3-1 of the programming guide shows the way it is organized