simpleMultiGPU - portable with or without wc Slow down with portable with or without wc memory

Hello,

I modified the SDK example simpleMultiGPU by replacing:

h_Data = (float *)malloc(DATA_N * sizeof(float));

with

CUDA_SAFE_CALL( cudaHostAlloc((void **)&h_Data, (DATA_N * sizeof(float)),

                  ( cudaHostAllocPortable | cudaHostAllocWriteCombined ) ) );

(Also replaced the free(h_Data) with cudaFreeHost(h_Data).)

The system has a dual-device card with GeForce GTX 590.

The processing time with malloc() was 222 milliseconds; with cudaHostAlloc() it was 322 ms (having or not having cudaHostAllocWriteCombined didn’t affect the time).

What’s the explanation for the difference in performance?

Thanks,