Optimize data transfer rate from host to device

Dear All

It’s my first time using CUDA,
I got a transfer rate from host to device about 5.5G/s using pinned memory.
Calculation in my work is very fast , the transfer time cost almost >95%,
in this situation , “stream” can not help me more, so I want to improve the transfer time.

Here is my environment :

===Hardware===
Video Card :1080GTX
CPU : Intel(R)Xeon(R)CPU E5-1620 0 @ 3.6GHz
Ram : 96 GB (DDR3)
MotherBoard : X9SRA

===Software===
OS : Win7 x64
IDE : Visual Studio 2010 SP1
CUDA : 8.0
Driver : version 384.76

I Got about 3.5 GB/s(Pageable) and 6 GB/s(pinned) in BandwidthTest sample code, and almost same performance
in CUDA-Z.

I have confirmed that my card is set in PCIe 3.0 x16 slot ,
and there’s only one video card on my motherboard.

And I have read some discussion that mentioned DDR4 memory,
so I did the same test on system of DDR4 memory , but it did’nt make difference.

Here is my question:

What’s the reasonable data transfer rate in my case ?
I’m thinking about > 10GB/s will be fine, am I wrong ?

I have read about 16MB issue in discussion, but I have no idea about this ,
could someone provide detailed explanation ?

Did I miss some important setting like BIOS ?

Here is my pseudo code in test :

int SzImg = 2000*2048;
//host
BYTE *pHostBuffer_PageLocked = NULL;
unsigned int tag = cudaHostAllocWriteCombined;
cudaHostAlloc((void **)&pHostBuffer_PageLocked, SzImg * sizeof(BYTE), tag);

//device
BYTE pDeviceBuffer = NULL;
cudaMalloc((int
*)&pDeviceBuffer, SzImg * sizeof(BYTE));

//run
::QueryPerformanceCounter(&llStart_)

cudaMemcpy(pDeviceBuffer, pHostBuffer, SzImg * sizeof(BYTE), cudaMemcpyHostToDevice);

::QueryPerformanceCounter(&llEnd_);

//end–cost about 0.71 ms

#Modify 20170727 11:58 wrong log time

Anything that may help me will be appreciated !

Best Regards,
David

You should be able to achieve transfer rates of 11-12 GB/sec on a PCIe gen3 x16 link with pinned host memory for large transfers (>= 16 MB).

The transfer rate from regular pageable memory will be lower. How much lower depends on the speed of the host’s system memory. Use a platform with as many DDR4 channels and as fast a speed grade of DDR4 as you can afford.

PCIe uses packetized transport and the overhead is higher the smaller the payload. This means you will observe very low effective throughput when performing small transfers and approach full performance as you get into transfer sizes in the MBytes. So you would want to configure traffic between host and device such that the transferred blocks are as large as feasible.

Thanks for kind reply , and I just found some mistake in log time , it should be 0.71sec for transfer 4MB array,
so it’s close to 5.5G/s.

According to your suggestion , I try to upload 128 MB once , the rate is slightly up to 6.2 GB/s !

I’ll get some DDR4 to do it again.

If these rates are for transfers from/to pinned host memory, it would seem the PCIe link is not properly configured. Either it is operating in x8 mode, or with PCIe gen 2 x16 performance.

Is the GPU plugged into the correct slot? Do you have multiple PCIe devices that are forced to share PCIe lanes (shouldn’t be the case, as your CPU provides 40 PCIe lanes)? Check the system BIOS settings (if any) related to PCIe. Also, make sure that the GPU is properly seated in the PCIe slot, and its bracket securely fastened to the case (eliminating issues with mechanical stress on the connector and/or vibrations).