hello, i’m new to cuda.
i’m reading the cuda programming guide 3.0 and i noticed a little bit of information lack in code at PDF page 34-35:
[codebox]typedef struct {
int width;
int height;
int stride;
float* elements; } Matrix;[/codebox]
[codebox]device float GetElement(const Matrix A, int row, int col) {
return A.elements[row * A.stride + col]; }[/codebox]
[codebox]
int main (){
Matrix A[USERVALUE][USERVALUE] = userinizialization;
Matrix d_A;
d_A.width = d_A.stride = USERVALUE;
d_A.height = USERVALUEt;
size_t size = d_A.width * d_A.height * sizeof(float);
cudaMalloc((void**)&d_A.elements, size); cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);
…
}
[/codebox]
I have noticed that in the host main() function, a MATRIX STRUCTURE d_A is created. Then a CUDAMALLOC is made only on ELEMENTS field of the strucure. So, only this part of the d_A structure resides on device (GPU). Then a device function is defined and called in a kernel, but this function uses the “STRIDE” field of the structure d_A passed as constant.
so my question is:
- “how can this program work if not all field of the structure are present into the device memory space ?”
maybe, i have found an answer on internet that is
“Function arguments automatically copied from host to device”.
But if it is so:
- why i have to cudaMalloc the array ? Is it for speed increasing ?
and if it is for speed increasing,
- can i cudaMalloc the complete structure and then cudaMalloc the elements array in order to avoid automatic host to device copy ?