I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:
clock_t tmpTime = clock(); calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size()); cout << "calcMatches = " << clock() - tmpTime << endl; tmpTime = clock(); int * h_matches = (int *)malloc(points1.size() * sizeof(int)); cout << "malloc = " << clock() - tmpTime << endl; tmpTime = clock(); cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost)); cout << "memCpy = " << clock() - tmpTime << endl;
The output for this section of code is, in an example:
calcMatches = 10000
malloc = 0
memCpy = 30000
The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.
Is there a way for optimizing these times? Thank you in advance