cudaMemcpy half bandwidthTest --memory=pinned ftfm

wlangdon · October 8, 2010, 6:14pm

I expect everyone else has already stumbled into this but…
the speed I measure with by cudaThreadSynchronize(); cutStopTimer(hTimer);
for multiple cudaMemcpy is 1299 million bytes/sec cudaMemcpyHostToDevice
and 764 million bytes/sec cudaMemcpyDeviceToHost on average
transers from 4K to 9Mbytes. This is less than half the numbers reported
for my system by bandwidthTest --memory=pinned

The nVidia CUDA Programming Guide (v2.3) suggests this is due to double
copying in the device driver and I should be using page-locked memory
and/or zero copy. I have yet to try either…

Comments welcome

Bill

wlangdon · October 8, 2010, 6:14pm

I expect everyone else has already stumbled into this but…
the speed I measure with by cudaThreadSynchronize(); cutStopTimer(hTimer);
for multiple cudaMemcpy is 1299 million bytes/sec cudaMemcpyHostToDevice
and 764 million bytes/sec cudaMemcpyDeviceToHost on average
transers from 4K to 9Mbytes. This is less than half the numbers reported
for my system by bandwidthTest --memory=pinned

The nVidia CUDA Programming Guide (v2.3) suggests this is due to double
copying in the device driver and I should be using page-locked memory
and/or zero copy. I have yet to try either…

Comments welcome

Bill

wlangdon · October 16, 2010, 1:44pm

There does not seem to be a need to tell cudaMemcpy explicitly that the host buffer is pinned.

Creaing with cudaMallocHost is sufficient.

[codebox] //based on bandwidthTest.cu

unsigned int* AA;

cutilSafeCall( cudaMallocHost( (void**)&AA, BWNLwsizeof(unsigned int) ));

cutilSafeCall(cudaMemcpy(d_A,A,NLwsizesizeof(unsigned int),cudaMemcpyHostToDevice));

[/codebox]

Typical uplink to 295 GTX speed now 2.38779e+09 bytes/sec

and back to Centos host 1.93928e+09 bytes/sec

wlangdon · October 16, 2010, 1:44pm

There does not seem to be a need to tell cudaMemcpy explicitly that the host buffer is pinned.

Creaing with cudaMallocHost is sufficient.

[codebox] //based on bandwidthTest.cu

unsigned int* AA;

cutilSafeCall( cudaMallocHost( (void**)&AA, BWNLwsizeof(unsigned int) ));

cutilSafeCall(cudaMemcpy(d_A,A,NLwsizesizeof(unsigned int),cudaMemcpyHostToDevice));

[/codebox]

Typical uplink to 295 GTX speed now 2.38779e+09 bytes/sec

and back to Centos host 1.93928e+09 bytes/sec

Charley · October 16, 2010, 2:50pm

Page locked memory is also called pinned memory. bandwidthTest --memory=pinned is measuring bw with pagelocked memory so your comparison is only valid if you use pagelocked memory in your measurements.

Charley · October 16, 2010, 2:50pm

Page locked memory is also called pinned memory. bandwidthTest --memory=pinned is measuring bw with pagelocked memory so your comparison is only valid if you use pagelocked memory in your measurements.

wlangdon · October 16, 2010, 4:13pm

Have I still got it wrong? I thought using cudaMallocHost did mean that the host buffer was indeed pagelocked.

Bill

wlangdon · October 16, 2010, 4:13pm

Have I still got it wrong? I thought using cudaMallocHost did mean that the host buffer was indeed pagelocked.

Bill

mfatica · October 16, 2010, 5:05pm

You got it right, using cudaMallocHost will give you a page-locked host buffer.
The bandwidth will depend on the payload size.
If you transfer a lot of very small packets, you will get a lower bandwidth that doing the transfer all at once.
To have a better idea of the characteristic of your chipset, make a plot that shows the BW vs payload size.

mfatica · October 16, 2010, 5:05pm

You got it right, using cudaMallocHost will give you a page-locked host buffer.
The bandwidth will depend on the payload size.
If you transfer a lot of very small packets, you will get a lower bandwidth that doing the transfer all at once.
To have a better idea of the characteristic of your chipset, make a plot that shows the BW vs payload size.

Topic		Replies	Views
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2806	June 24, 2016
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5265	February 26, 2010
Cuda Memcopy need over 12ms for 16MB CUDA Programming and Performance	11	2699	January 30, 2009
Why it is so slow to use cudamemcpy（cudaMemcpyHostToHost）on tx2 Jetson TX2 cuda	5	1169	October 18, 2021
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	597	March 7, 2019
cudaFree() boosts bandwidth performance unwanted feature :) CUDA Programming and Performance	6	15925	September 24, 2007
Bad PCIe transfer performance (cudaMemcpy), what can cause that? CUDA Programming and Performance	10	11532	September 20, 2010
FFT Pinned and Paged SpeedTesting Issue CUDA Programming and Performance	3	762	November 26, 2010
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2621	April 27, 2009
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1986	April 18, 2017

cudaMemcpy half bandwidthTest --memory=pinned ftfm

Related topics