HI
i am newbie in cuda programming (on QuadraFX5800, 4GO, Win7, 64)
I try matrix multiplication. It works well BUT
when analysing performances with nsight, i realize that copying from device to host was 200x longer than host to device !
is it normal ?
Vincent
Nsight report :
cuMemcpyDtoH_v2 1 738 671 µs for a 2000 x 2000 matrix
cuMemcpyHtoD_v2 9 317 µs for a 2000 x 2000 matrix
code:
unsigned int M_DIM_LIG=2000;
unsigned int M_DIM_COL=2000;
unsigned int P_DIM_LIG=M_DIM_LIG;
unsigned int P_DIM_COL=M_DIM_COL;
…
float matM= new float[M_DIM_LIG * M_DIM_COL];
float matPGPU= new float[P_DIM_LIG * P_DIM_COL];
…
cudaMalloc((void **) &devM, M_DIM_COLM_DIM_LIGsizeof(float));
cudaMalloc((void **) &devN, N_DIM_COLN_DIM_LIGsizeof(float));
cudaMalloc((void **) &devP, P_DIM_COLP_DIM_LIGsizeof(float));
…
cudaMemcpy(devM,matM, M_DIM_COLM_DIM_LIGsizeof(float),cudaMemcpyHostToDevice); // 9 317 µs
…
cudaMemcpy(matPGPU,devP,P_DIM_COLP_DIM_LIGsizeof(float),cudaMemcpyDeviceToHost); // 1 738 671 µs