I modified the SDK example simpleMultiGPU by replacing:
h_Data = (float *)malloc(DATA_N * sizeof(float));
CUDA_SAFE_CALL( cudaHostAlloc((void **)&h_Data, (DATA_N * sizeof(float)), ( cudaHostAllocPortable | cudaHostAllocWriteCombined ) ) );
(Also replaced the free(h_Data) with cudaFreeHost(h_Data).)
The system has a dual-device card with GeForce GTX 590.
The processing time with malloc() was 222 milliseconds; with cudaHostAlloc() it was 322 ms (having or not having cudaHostAllocWriteCombined didn’t affect the time).
What’s the explanation for the difference in performance?