cudaMemcpy half bandwidthTest --memory=pinned ftfm

I expect everyone else has already stumbled into this but…
the speed I measure with by cudaThreadSynchronize(); cutStopTimer(hTimer);
for multiple cudaMemcpy is 1299 million bytes/sec cudaMemcpyHostToDevice
and 764 million bytes/sec cudaMemcpyDeviceToHost on average
transers from 4K to 9Mbytes. This is less than half the numbers reported
for my system by bandwidthTest --memory=pinned

The nVidia CUDA Programming Guide (v2.3) suggests this is due to double
copying in the device driver and I should be using page-locked memory
and/or zero copy. I have yet to try either…

Comments welcome

Bill

I expect everyone else has already stumbled into this but…
the speed I measure with by cudaThreadSynchronize(); cutStopTimer(hTimer);
for multiple cudaMemcpy is 1299 million bytes/sec cudaMemcpyHostToDevice
and 764 million bytes/sec cudaMemcpyDeviceToHost on average
transers from 4K to 9Mbytes. This is less than half the numbers reported
for my system by bandwidthTest --memory=pinned

The nVidia CUDA Programming Guide (v2.3) suggests this is due to double
copying in the device driver and I should be using page-locked memory
and/or zero copy. I have yet to try either…

Comments welcome

Bill

There does not seem to be a need to tell cudaMemcpy explicitly that the host buffer is pinned.

Creaing with cudaMallocHost is sufficient.

[codebox] //based on bandwidthTest.cu

unsigned int* AA;

cutilSafeCall( cudaMallocHost( (void**)&AA, BWNLwsizeof(unsigned int) ));

cutilSafeCall(cudaMemcpy(d_A,A,NLwsizesizeof(unsigned int),cudaMemcpyHostToDevice));

[/codebox]

Typical uplink to 295 GTX speed now 2.38779e+09 bytes/sec

and back to Centos host 1.93928e+09 bytes/sec

There does not seem to be a need to tell cudaMemcpy explicitly that the host buffer is pinned.

Creaing with cudaMallocHost is sufficient.

[codebox] //based on bandwidthTest.cu

unsigned int* AA;

cutilSafeCall( cudaMallocHost( (void**)&AA, BWNLwsizeof(unsigned int) ));

cutilSafeCall(cudaMemcpy(d_A,A,NLwsizesizeof(unsigned int),cudaMemcpyHostToDevice));

[/codebox]

Typical uplink to 295 GTX speed now 2.38779e+09 bytes/sec

and back to Centos host 1.93928e+09 bytes/sec

Page locked memory is also called pinned memory. bandwidthTest --memory=pinned is measuring bw with pagelocked memory so your comparison is only valid if you use pagelocked memory in your measurements.

Page locked memory is also called pinned memory. bandwidthTest --memory=pinned is measuring bw with pagelocked memory so your comparison is only valid if you use pagelocked memory in your measurements.

Have I still got it wrong? I thought using cudaMallocHost did mean that the host buffer was indeed pagelocked.

Bill

Have I still got it wrong? I thought using cudaMallocHost did mean that the host buffer was indeed pagelocked.

Bill

You got it right, using cudaMallocHost will give you a page-locked host buffer.
The bandwidth will depend on the payload size.
If you transfer a lot of very small packets, you will get a lower bandwidth that doing the transfer all at once.
To have a better idea of the characteristic of your chipset, make a plot that shows the BW vs payload size.

You got it right, using cudaMallocHost will give you a page-locked host buffer.
The bandwidth will depend on the payload size.
If you transfer a lot of very small packets, you will get a lower bandwidth that doing the transfer all at once.
To have a better idea of the characteristic of your chipset, make a plot that shows the BW vs payload size.