I played around with writing my own cudaMemcpy, and succeeded in improving on it a little bit. Note that in the case of both my memcpy and the original cudaMemcpy, the scores are much higher versus ./bandwidthTest because the SDK sample doesn’t run enough iterations.
Here is the usage of dubyBandwithTest.exe
Usage:
dubyBandwidthTest <threads-per-block> <transfer size, in bytes> <repetitions>
E.g. for GTX260:
dubyBandwidthTest 448 268435456 1000
Please specify the number of threads per block to run, as the first argument
Best performance is usually hit when dimBlock.x is a multiple of DDR bus width
If you have a 448 bit bus, try 448 or 224
If you have a 128 bit bus, try 128, 256, or 512
Here are my results on 8600GT / Vista x32 (theoretical is 25.6 GB/s):
> dubyBandwidthTest.exe 512 67108864 1000
GPU: GeForce 8600 GT
blocks per grid = 4
threads per block = 512
memcpy() size = 64.00 MB
repetitions = 1000
cudaMemcpy() Bandwidth = 18.4 GB/s
dubyMemcpy() Bandwidth = 22.1 GB/s
An improvement of 20%! This gap widens when an odd transfer size is requested.
> dubyBandwidthTest.exe 512 64000001 1000
GPU: GeForce 8600 GT
blocks per grid = 4
threads per block = 512
memcpy() size = 61.04 MB
repetitions = 1000
cudaMemcpy() Bandwidth = 16.3 GB/s
dubyMemcpy() Bandwidth = 21.9 GB/s
Here are the results for GTX260 / Vista x64 (theoretical is 111.8 GB/s):
> dubyBandwidthTest.exe 448 268435456 10000
GPU: GeForce GTX 260
blocks per grid = 24
threads per block = 448
memcpy() size = 264.00 MB
repetitions = 10000
cudaMemcpy() Bandwidth = 105.4 GB/s
dubyMemcpy() Bandwidth = 107.1 GB/s
The gap is much smaller.
> dubyBandwidthTest.exe 448 32000001 1000
GPU: GeForce GTX 260
blocks per grid = 24
threads per block = 448
memcpy() size = 30.52 MB
repetitions = 1000
cudaMemcpy() Bandwidth = 86.9 GB/s
dubyMemcpy() Bandwidth = 106.7 GB/s
But cudaMemcpy() falls behind when transfer size is odd.
Note that on GTX260, both memcpy()'s achieve very close to theoretical. Up to 96% efficiency!
dubyBandwidthTest.rar (119 KB)