Data transfer from host to two GPUs in a cluster (MultiGPU Programming)

I am trying to write a program to take advantages of MultiGPU programming. What I want to do is to scatter some data among several GPUs, those GPUs computes data and gather back into CPU thread. But I have some issues. The problem is that execution in several devices seems not being parallel with all calls made in an asynchronous form.
To illustrate the issue I will take as an example de SDK application SimpleMultiGPU. Basically, what I am interested in this application is the time spent in the sequence:
Select device -> Asychronous copy H2D -> Launch kernel -> Asychronous copy D2H

Each GPU has it’s own stream that is different from stream 0.
I have taken several times, with CUDA events and with CPU timer. With CUDA timers I have timers for the three functions, i.e.:
start Event -> As. cop. H2D -> Stop Event -> Start Event -> kernel -> stop Ev. -> start ev. -> As. cop. D2H -> stop Ev. -> stream Synchronize -> take time.

With CPU time, I have measured:
start CPU time -> Async. copy H2D GPU0 -> Launch kernel GPU0 -> Async. copy D2H GPU0 -> Async. copy H2D GPU1 -> Launch kernel GPU1 -> Async. copy D2H GPU1 -> Stream Synchronize -> stop CPU time.

So, with CUDA events I can calculate total time spent with one GPU. With CPU time I can calculate whole time of the task, times spent by GPU0 and GPU1. Because all is asynchronous and the GPUs are two different devices, It is supposed that CPU time should be equal or slightly greater than the maximum time among GPUs.

TIME FROM SimpleMultiGPU example (Uses pinned Memory)
CPU total time: 0.016832 sg.
GPU0 time: 0.012701 sg.
Kernel execution: 0.000934 sg.
H2D Memcopy: 0.011731 sg.
D2H Memcopy: 0.000017 sg.
GPU1 time: 0.016728 sg.
Kernel execution: 0.000941 sg.
H2D Memcopy: 0.015753 sg.
D2H Memcopy: 0.000017 sg.

As is spected, time CPU time and GPU1 time are almost equal. Mainly time is spent in H2D Memcopy (~67 MB). As CPU time and GPU1 time are equal, I guess that AsyncMemcopy from the same host to two differente devices is done in parallel. But this example uses Pinned Memory. Memory is reserved with cudaMallocHost.

I didn’t get the behaviour using normal memory (reserved with malloc). Times that I have got using pageable memory are shown below:

CPU total time: 0.040939 sg.
GPU0 time: 0.020185 sg.
Kernel execution: 0.001094 sg.
H2D Memcopy: 0.018990 sg.
D2H Memcopy: 0.000088 sg.
GPU1 time: 0.020340 sg.
Kernel execution: 0.001093 sg.
H2D Memcopy: 0.019137 sg.
D2H Memcopy: 0.000091 sg.

Times are greater but this is no important. The issue is that CPU time is twice as GPUx time (with pinned memory was the same time). Mainly time is spent by H2D Memcopy so, what I deduce to explain this (because I have no other idea) is that H2D Memcopy are done sequencially for some reason (I want to recall that I am using two different devices to issue commands). I expected that CPU timer was equal to one GPU timer (plus some overhead increment).

Is there something in the CUDA runtime o maybe in DMA that explains such behaviour?

I am using CUDA 4.0 with the 4.1 CUDA driver version. Each node in the cluster has two Testa M2090 (compute capability 2.0). My OS is Red Hat Enterprise Linux Server release 5.3 (Tikanga) x86_64 version and PCI-Express version is 2.0.

Thanks in advance.

I answer myself. I haven’t read cudaMemcpyAsync carefully. It reads that only pinned memory is allowed to that function. However an error must be thrown when pageable is passed. In this case, no error was shown.

As far as I am aware, cudaMemcpyAsync() falls back to cudaMemcpy() when a pointer to non-pinned memory is passed, for historical reasons. While returning an error would be more desirable to detect the kind of situation you encountered doing so could cause some older software to break.

When cudaMemcpyAsync() appears to behave in blocking fashion, the first thing to check is whether the host data is in fact pinned.