I am stucked with this problem. Can you guys help me out? Thanks.
I tried to record the data transfer time CPU->GPU, GPU->CPU in matrixMul project.
There are two memcpy for host->device (for matrix A, B), one memcpy for device->host. (for matrix C)
I set the datasize as the same for A, B, C.
the first memcpy from host ->device is much less than the second host->device. (the second one is 3 times or more than the first.) I have taken care of threadSynchronize.
host->device time is much larger than device->host. I checked the bandwidthTest result, both directions have similar bandwidth.
problem 1 and 2 happen for both single precision and double precision.
The time for my kernel with double precision is 7 times as that with single precsion. Both single precsion and double precision have the same functionality except that one is float, one is double.
I insert the record time code like this for recording time of host->device. I did the same thing for device->host.
unsigned int timer = 0; cutilSafeCall( cudaThreadSynchronize() ); cutilCheckError(cutCreateTimer(&timer)); cutilCheckError(cutStartTimer(timer)); // copy host memory to device cutilSafeCall(cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice) ); cutilSafeCall(cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice) ); cutilSafeCall( cudaThreadSynchronize() ); cutilCheckError(cutStopTimer(timer)); printf("GPU time: %f (ms) \n", cutGetTimerValue(timer)); cutilCheckError(cutDeleteTimer(timer));