Full-duplex PCI data exchange how do I achieve it ?

Hi All,

I have a problem which is likely met by many, but I can not find a good answer… please help

I need to organize efficient bi-directional data transfer between host and device (descrete device). The data swap is totally simmetrical in size, so I hope it can be performed concurrently in 2 directions.
I found in other topics that full-duplex PCI can only be achieved using “zero-copy” method, which means that the kernel uses host pointer mapped to device space. The memory access must be coalescent.

How to achieve this, however, is not completely clear.

I procede as follows:

buffer_in_hist - “float” buffer on host for H2D transfer, adress mapped to device
buffer_in_dev - “float” buffer on device for H2D transfer
buffer_out_hist - “float” buffer on host for D2H transfer, adress mapped to device
buffer_out_dev - “float” buffer on host for D2H transfer

I write the kernel as follows:

global void exchange_data(float *buffer_in_host,
float *buffer_in_dev,
float *buffer_out_host,
float *buffer_out_dev,
int threadsize)
{

int num = blockIdx.x*threadsize + threadIdx.x;

buffer_in_dev[num] = buffer_in_host[num];
buffer_out_host[num] = buffer_out_dev[num];

}

The threadsize is 128. Block size is sufficiently large (about 3MB of data is swapped in 1 kernel lunch)

In my view, the kernel I wrote performs coalescent data access.
It works, but performance is less than what is achieved by 2 concurrent memcpy instructions.

Did I achieve coalescent memory access ?
Did I do anything wrong or is there any trick that I miss ?
If so, how can I fix my kernel ?

There is one more concern. Ideally, my bi-directional data transfer should be executed concurrently with another kernel. If I am using zero-copy method, it seems I can not have a concurrent kernel running anymore… Is that true ?
If so, that means that full-duplex can not be fully exploited.

Thank you very much

Look into using streams, they’ll let you overlap kernel execution and data transfer.

If you’re going back and forth between two data buffers, you can just use two streams. Heck, you could even do a little bit-hacking and create a counter (int32) that increments with every kernel launch, and simply AND it with 1 to get the stream Id (e.g. int StreamId = (myCounter & 1)).

Thanks for reply,

Please check out this topic:

http://forums.nvidia.com/index.php?showtop…rt=#entry552657

It states that memcpy is half-duplex only. Plus, we know that only one memcpy operation can be executed at any given time, no matter which stream. So it appears that this way does not work.

In fact, to check on your suggestion, I tried to set 2 stream for 2 memcpy operations (one H2D and 1 D2H) and it does not increase the performance. Actually, it performs slightly worse with 2 streams (probably due to latency associated with additional sream). The data transfer in 2 directions DO NOT become full-duplex by using 2 streams.

I am still convinced that “zero-copy” is may be the only possibility, if one exists, to perform full-duplex data transfer and a concurrent kernel execution.

Fermi will have dual DMA engines and thus full duplex data transfer.

You waste a lot of time on block setup, as each block only does one memory move. Try using fewer blocks that do several moves each.