Hi All,
I have a problem which is likely met by many, but I can not find a good answer… please help
I need to organize efficient bi-directional data transfer between host and device (descrete device). The data swap is totally simmetrical in size, so I hope it can be performed concurrently in 2 directions.
I found in other topics that full-duplex PCI can only be achieved using “zero-copy” method, which means that the kernel uses host pointer mapped to device space. The memory access must be coalescent.
How to achieve this, however, is not completely clear.
I procede as follows:
buffer_in_hist - “float” buffer on host for H2D transfer, adress mapped to device
buffer_in_dev - “float” buffer on device for H2D transfer
buffer_out_hist - “float” buffer on host for D2H transfer, adress mapped to device
buffer_out_dev - “float” buffer on host for D2H transfer
I write the kernel as follows:
global void exchange_data(float *buffer_in_host,
float *buffer_in_dev,
float *buffer_out_host,
float *buffer_out_dev,
int threadsize)
{
int num = blockIdx.x*threadsize + threadIdx.x;
buffer_in_dev[num] = buffer_in_host[num];
buffer_out_host[num] = buffer_out_dev[num];
}
The threadsize is 128. Block size is sufficiently large (about 3MB of data is swapped in 1 kernel lunch)
In my view, the kernel I wrote performs coalescent data access.
It works, but performance is less than what is achieved by 2 concurrent memcpy instructions.
Did I achieve coalescent memory access ?
Did I do anything wrong or is there any trick that I miss ?
If so, how can I fix my kernel ?
There is one more concern. Ideally, my bi-directional data transfer should be executed concurrently with another kernel. If I am using zero-copy method, it seems I can not have a concurrent kernel running anymore… Is that true ?
If so, that means that full-duplex can not be fully exploited.
Thank you very much