I expect everyone else has already stumbled into this but…
the speed I measure with by cudaThreadSynchronize(); cutStopTimer(hTimer);
for multiple cudaMemcpy is 1299 million bytes/sec cudaMemcpyHostToDevice
and 764 million bytes/sec cudaMemcpyDeviceToHost on average
transers from 4K to 9Mbytes. This is less than half the numbers reported
for my system by bandwidthTest --memory=pinned
The nVidia CUDA Programming Guide (v2.3) suggests this is due to double
copying in the device driver and I should be using page-locked memory
and/or zero copy. I have yet to try either…
I expect everyone else has already stumbled into this but…
the speed I measure with by cudaThreadSynchronize(); cutStopTimer(hTimer);
for multiple cudaMemcpy is 1299 million bytes/sec cudaMemcpyHostToDevice
and 764 million bytes/sec cudaMemcpyDeviceToHost on average
transers from 4K to 9Mbytes. This is less than half the numbers reported
for my system by bandwidthTest --memory=pinned
The nVidia CUDA Programming Guide (v2.3) suggests this is due to double
copying in the device driver and I should be using page-locked memory
and/or zero copy. I have yet to try either…
Page locked memory is also called pinned memory. bandwidthTest --memory=pinned is measuring bw with pagelocked memory so your comparison is only valid if you use pagelocked memory in your measurements.
Page locked memory is also called pinned memory. bandwidthTest --memory=pinned is measuring bw with pagelocked memory so your comparison is only valid if you use pagelocked memory in your measurements.
You got it right, using cudaMallocHost will give you a page-locked host buffer.
The bandwidth will depend on the payload size.
If you transfer a lot of very small packets, you will get a lower bandwidth that doing the transfer all at once.
To have a better idea of the characteristic of your chipset, make a plot that shows the BW vs payload size.
You got it right, using cudaMallocHost will give you a page-locked host buffer.
The bandwidth will depend on the payload size.
If you transfer a lot of very small packets, you will get a lower bandwidth that doing the transfer all at once.
To have a better idea of the characteristic of your chipset, make a plot that shows the BW vs payload size.