device function arguments question

hello, i’m new to cuda.

i’m reading the cuda programming guide 3.0 and i noticed a little bit of information lack in code at PDF page 34-35:

[codebox]typedef struct {

   int width;

   int height;

   int stride;

   float* elements; } Matrix;[/codebox]

[codebox]device float GetElement(const Matrix A, int row, int col) {

   return A.elements[row * A.stride + col]; }[/codebox]

[codebox]

int main (){

Matrix A[USERVALUE][USERVALUE] = userinizialization;

Matrix d_A;

d_A.width = d_A.stride = USERVALUE;

d_A.height = USERVALUEt;

size_t size = d_A.width * d_A.height * sizeof(float);

cudaMalloc((void**)&d_A.elements, size); cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

}

[/codebox]

I have noticed that in the host main() function, a MATRIX STRUCTURE d_A is created. Then a CUDAMALLOC is made only on ELEMENTS field of the strucure. So, only this part of the d_A structure resides on device (GPU). Then a device function is defined and called in a kernel, but this function uses the “STRIDE” field of the structure d_A passed as constant.

so my question is:

  1. “how can this program work if not all field of the structure are present into the device memory space ?”

maybe, i have found an answer on internet that is

“Function arguments automatically copied from host to device”.

But if it is so:

  1. why i have to cudaMalloc the array ? Is it for speed increasing ?

and if it is for speed increasing,

  1. can i cudaMalloc the complete structure and then cudaMalloc the elements array in order to avoid automatic host to device copy ?