Matrix Example, questions from the programming guide 2.3

if you go to page 20 on CudaProgramming_guide_2.3.pdf you will see the matrix multiplication example. i understand the global idea of that code,
they transfer the host data to device memory for the matrix elements. using cudaMalloc, thats ok and understandable, so they can read that data from video memory inside the kernel function… but

what about the values width and height of the Matrix struct?? no cudaMalloc for them, they just went through inside the struct as arguments for the kernel…, i mean how CUDA knows that these 2 variables width and height can be read even when they never did cudaMalloc for them??. how can the kernel read them?
and they dont even do cudaFree for them??, so looks like it had stayed on host memory all the time…a little confusing. since kernels do not read host memory , do they?
help to clear this questions would be appreaciated. its confusing me damnn lol

thanks

[codebox]Matrix d_A;

d_A.width = A.width; d_A.height = A.height;

size_t size = A.width * A.height * sizeof(float);

cudaMalloc((void**)&d_A.elements, size);

Matrix d_B;

d_B.width = B.width; d_B.height = B.height;

size = B.width * B.height * sizeof(float);

cudaMalloc((void**)&d_B.elements, size);

Matrix d_C;

d_C.width = C.width; d_C.height = C.height;

size = C.width * C.height * sizeof(float);

cudaMalloc((void**)&d_C.elements, size);

// Invoke kernel

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);[/codebox]

as far as I know, nvcc would copy

(1) execution configuration <<<dimGrid, dimBlock>>> into shared memory and

(2) function parameter (d_A, d_B, d_C) into hared memory (call by value)

so you need to allocate device memory explicit via

cudaMalloc((void**)&d_A.elements, size);

cudaMalloc((void**)&d_B.elements, size);

cudaMalloc((void**)&d_C.elements, size);

and content of pointers d_A.elements, d_B.elements, d_C.elements

would be put into shared memory (content is integer, intepreted as address)

second, life-time of shared memory is execution time of one thread block,

so you don’t need to care about function parameter, however

third, you allocate d_A.elements, d_B.elements, d_C.elements explictly from host code,

you need to release them.

hence in page 21, example code does free these memory explicitly via

// Free device memory

cudaFree(d_A.elements);

cudaFree(d_B.elements);

cudaFree(d_C.elements);

so if i pass some integers and floats as arguments, by value.

for example

float a = 45.0f;

int i = 100;

callKernel( a, i );

it is not necessary to move them to device memory manually?

yes, float a and int i would be put into shared memory, shared by threads in a thread block

thanks LSChien

you rock!