First thanks to all people who answered my last questions.
Now a new question:
Have a look at the attached screenshot.
Now let’s consider I have a computer with the following specifications:
-RAM: 640 MB
-Graphics card: GeForce 8400 GS 512 MB
Ok that’s not the best hardware but let’s use this one by now.
Now my question:
How big is the global memory (MB), how big the shared one and how big is the
local memory?
As far as I have understood the CUDA documentation, you should work on shared
memory as this is the fastest one. But, how much shared memory do I have?
Can I load data from the RAM directly into the shard memory?
In the matrix example (NVIDIA_CUDA_Programming_Guide_2.0.pdf, page
81 (according to Adobe Viewer’s page index)), the CUDA authors do the following:
[codebox]
// Load A and B to the device
float* Ad;
size = hA * wA * sizeof(float);
cudaMalloc((void**)&Ad, size);
cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
float* Bd;
size = wA * wB * sizeof(float);
cudaMalloc((void**)&Bd, size);
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
[…]
// Loop over all the sub-matrices of A and B required to
// compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Shared memory for the sub-matrix of A
shared float As[BLOCK_SIZE][BLOCK_SIZE];
// Shared memory for the sub-matrix of B
shared float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from global memory to shared memory;
// each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];[/codebox]
So do I have to do two memory copies to get my memory into shared memory?
Doesn’t that work from global to shared memory directly?
Thanks!!!