Optimize data transfer rate from host to device

MockingKwong · July 27, 2017, 2:36am

Dear All

It’s my first time using CUDA,
I got a transfer rate from host to device about 5.5G/s using pinned memory.
Calculation in my work is very fast , the transfer time cost almost >95%,
in this situation , “stream” can not help me more, so I want to improve the transfer time.

Here is my environment :

===Hardware===
Video Card :1080GTX
CPU : Intel(R)Xeon(R)CPU E5-1620 0 @ 3.6GHz
Ram : 96 GB (DDR3)
MotherBoard : X9SRA

===Software===
OS : Win7 x64
IDE : Visual Studio 2010 SP1
CUDA : 8.0
Driver : version 384.76

I Got about 3.5 GB/s(Pageable) and 6 GB/s(pinned) in BandwidthTest sample code, and almost same performance
in CUDA-Z.

I have confirmed that my card is set in PCIe 3.0 x16 slot ,
and there’s only one video card on my motherboard.

And I have read some discussion that mentioned DDR4 memory,
so I did the same test on system of DDR4 memory , but it did’nt make difference.

Here is my question:

What’s the reasonable data transfer rate in my case ?
I’m thinking about > 10GB/s will be fine, am I wrong ?

I have read about 16MB issue in discussion, but I have no idea about this ,
could someone provide detailed explanation ?

Did I miss some important setting like BIOS ?

Here is my pseudo code in test :

int SzImg = 2000*2048;
//host
BYTE *pHostBuffer_PageLocked = NULL;
unsigned int tag = cudaHostAllocWriteCombined;
cudaHostAlloc((void **)&pHostBuffer_PageLocked, SzImg * sizeof(BYTE), tag);

//device
BYTE pDeviceBuffer = NULL;
cudaMalloc((int*)&pDeviceBuffer, SzImg * sizeof(BYTE));

//run
::QueryPerformanceCounter(&llStart_)

cudaMemcpy(pDeviceBuffer, pHostBuffer, SzImg * sizeof(BYTE), cudaMemcpyHostToDevice);

::QueryPerformanceCounter(&llEnd_);

//end–cost about 0.71 ms

#Modify 20170727 11:58 wrong log time

Anything that may help me will be appreciated !

Best Regards,
David

njuffa · July 27, 2017, 3:42am

You should be able to achieve transfer rates of 11-12 GB/sec on a PCIe gen3 x16 link with pinned host memory for large transfers (>= 16 MB).

The transfer rate from regular pageable memory will be lower. How much lower depends on the speed of the host’s system memory. Use a platform with as many DDR4 channels and as fast a speed grade of DDR4 as you can afford.

PCIe uses packetized transport and the overhead is higher the smaller the payload. This means you will observe very low effective throughput when performing small transfers and approach full performance as you get into transfer sizes in the MBytes. So you would want to configure traffic between host and device such that the transferred blocks are as large as feasible.

MockingKwong · July 27, 2017, 4:08am

Thanks for kind reply , and I just found some mistake in log time , it should be 0.71sec for transfer 4MB array,
so it’s close to 5.5G/s.

According to your suggestion , I try to upload 128 MB once , the rate is slightly up to 6.2 GB/s !

I’ll get some DDR4 to do it again.

njuffa · July 27, 2017, 4:17am

If these rates are for transfers from/to pinned host memory, it would seem the PCIe link is not properly configured. Either it is operating in x8 mode, or with PCIe gen 2 x16 performance.

Is the GPU plugged into the correct slot? Do you have multiple PCIe devices that are forced to share PCIe lanes (shouldn’t be the case, as your CPU provides 40 PCIe lanes)? Check the system BIOS settings (if any) related to PCIe. Also, make sure that the GPU is properly seated in the PCIe slot, and its bracket securely fastened to the case (eliminating issues with mechanical stress on the connector and/or vibrations).

Topic		Replies	Views
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2691	April 27, 2009
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10027	June 30, 2009
Bad PCIe transfer performance (cudaMemcpy), what can cause that? CUDA Programming and Performance	10	11606	September 20, 2010
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68186	April 18, 2008
Data transfer speed between G80 and main memory CUDA Programming and Performance	17	12346	January 26, 2008
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3615	April 25, 2008
Improving data transfer performance from host to device CUDA Programming and Performance	2	2112	January 28, 2015
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15577	December 11, 2009
bandwidthTest anomaly! CUDA Programming and Performance	4	10892	July 31, 2009
What factors effect GPU transfer speed? CUDA Programming and Performance	7	9175	September 15, 2009

Optimize data transfer rate from host to device

Related topics