device function arguments question

hello, i’m new to cuda.

i’m reading the cuda programming guide 3.0 and i noticed a little bit of information lack in code at PDF page 34-35:

[codebox]typedef struct {

   int width;

   int height;

   int stride;

   float* elements; } Matrix;[/codebox]

[codebox]device float GetElement(const Matrix A, int row, int col) {

   return A.elements[row * A.stride + col]; }[/codebox]

[codebox]

int main (){

Matrix A[USERVALUE][USERVALUE] = userinizialization;

Matrix d_A;

d_A.width = d_A.stride = USERVALUE;

d_A.height = USERVALUEt;

size_t size = d_A.width * d_A.height * sizeof(float);

cudaMalloc((void**)&d_A.elements, size); cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

}

[/codebox]

I have noticed that in the host main() function, a MATRIX STRUCTURE d_A is created. Then a CUDAMALLOC is made only on ELEMENTS field of the strucure. So, only this part of the d_A structure resides on device (GPU). Then a device function is defined and called in a kernel, but this function uses the “STRIDE” field of the structure d_A passed as constant.

so my question is:

  1. “how can this program work if not all field of the structure are present into the device memory space ?”

maybe, i have found an answer on internet that is

“Function arguments automatically copied from host to device”.

But if it is so:

  1. why i have to cudaMalloc the array ? Is it for speed increasing ?

and if it is for speed increasing,

  1. can i cudaMalloc the complete structure and then cudaMalloc the elements array in order to avoid automatic host to device copy ?

Like you said, the value is automatically transfered/copied from host to device for a single data type (this case stride is a single type, not an array).

Arrays are not automagically copied from host to device, they must be explicitly copied by the user.

Why would you want to avoid automatic host->device copying? The code you are showing cudaMalloc’s only the memory that is needed for the kernel, and lets the compiler/gpu figure out the rest that needs to be automatically copied. Basically, why cudaMalloc more than needed, which will take more time, and then cudaMemcpy more than needed, which will also take more time?