my speedy Memcpy()

I played around with writing my own cudaMemcpy, and succeeded in improving on it a little bit. Note that in the case of both my memcpy and the original cudaMemcpy, the scores are much higher versus ./bandwidthTest because the SDK sample doesn’t run enough iterations.

Here is the usage of dubyBandwithTest.exe

Usage:

  dubyBandwidthTest <threads-per-block> <transfer size, in bytes> <repetitions>

E.g. for GTX260:

  dubyBandwidthTest 448 268435456 1000

Please specify the number of threads per block to run, as the first argument

Best performance is usually hit when dimBlock.x is a multiple of DDR bus width

If you have a 448 bit bus, try 448 or 224

If you have a 128 bit bus, try 128, 256, or 512

Here are my results on 8600GT / Vista x32 (theoretical is 25.6 GB/s):

> dubyBandwidthTest.exe 512 67108864 1000

GPU: GeForce 8600 GT

blocks per grid = 4

threads per block = 512

memcpy() size = 64.00 MB

repetitions = 1000

cudaMemcpy() Bandwidth = 18.4 GB/s

dubyMemcpy() Bandwidth = 22.1 GB/s

An improvement of 20%! This gap widens when an odd transfer size is requested.

> dubyBandwidthTest.exe 512 64000001 1000

GPU: GeForce 8600 GT

blocks per grid = 4

threads per block = 512

memcpy() size = 61.04 MB

repetitions = 1000

cudaMemcpy() Bandwidth = 16.3 GB/s

dubyMemcpy() Bandwidth = 21.9 GB/s

Here are the results for GTX260 / Vista x64 (theoretical is 111.8 GB/s):

> dubyBandwidthTest.exe 448 268435456 10000

GPU: GeForce GTX 260

blocks per grid = 24

threads per block = 448

memcpy() size = 264.00 MB

repetitions = 10000

cudaMemcpy() Bandwidth = 105.4 GB/s

dubyMemcpy() Bandwidth = 107.1 GB/s

The gap is much smaller.

> dubyBandwidthTest.exe 448 32000001 1000

GPU: GeForce GTX 260

blocks per grid = 24

threads per block = 448

memcpy() size = 30.52 MB

repetitions = 1000

cudaMemcpy() Bandwidth = 86.9 GB/s

dubyMemcpy() Bandwidth = 106.7 GB/s

But cudaMemcpy() falls behind when transfer size is odd.

Note that on GTX260, both memcpy()'s achieve very close to theoretical. Up to 96% efficiency!
dubyBandwidthTest.rar (119 KB)

Alex:

Now that is COOL!

Why do you think it beats the built in memcpy speed? This is likely more about speculating how NV implements it.

In your CUDA code, what did your experiments show as the important features?
One key thing you’re doing is two int2 copies at once… perhaps that helps hide latency problems by having more than one transaction in flight at once per thread?

It’d be nice to see a graph of transfer speed with transaction size… both of this code and the native memcopy.
There may be some tweaking left to do for smaller transactions, especially when you drop down to partial copies (even the char copies).
The graph for small chunk speed may show some of that speed… (or maybe not, that’s why it’s interesting.)

I wonder if the CUDA launch overhead (~10us) would make native transfers faster for small copies.

How does the xfer speed vary with thread count?

Why do you think some “warmup” transfers are needed for best performance?

I wonder if the native memcopy call actually fires off a CUDA program to do device/device transfers… I think we all expected it to do something lower level, but this project shows that it’s probably not!

This is a cool project… even ignoring the use as a memcopy tool, it may be a good baseline for other memory bandwidth limited applications to see how they might pump data faster.

Warm-up transfers are needed becuase of cache or more probably TLB, I think. I’ve seen similar phenomenon.

I bet you’re exactly right. The GPU has no memory cache, but there’s likely a hardware TLB.

Now if that’s true, then a SINGLE warmup memcpy would be enough to fill the TLB entries, repeated calls would not help.

Hmm, maybe it has to do with asynchronous queueing.

When writing this memory copy function, I tried to incorporate my knowledge of DRAM. DRAM is not a simple addressable memory like SRAM. It requires multiple commands, which do things like prepare regions of it for reading, specify a row, and issue a read. A read lasts for multiple clock cycles of the 8byte bus (think: the transactions mentioned in the Guide). The issueing of commands can overlap data transfer if the command’s latency is brief enough (think: CL latency and all those BIOS settings). Commands that need to be issued per read burst can be kept to a minimum if readdressing isn’t done. Finally, the DRAM is arranged as channels, or ‘partitions’, with the interlace factor being 256 bytes.

The ideal strategy for copying DRAM, generally, is to read in as much of it as possible sequentially into a buffer, then write it back out. I tried to implement this strategy by assigning one multiprocessor to a memory channel, and organizing the threads to read from the channel sequentially into an smem buffer (with the assumption that warps are round-robinned in-order). This strategy failed. Maybe I’ll go back and reanalyze the code using decuda to see if it is inefficient at the assembly level, but the 25% efficiency was discouraging.

The approach that is used in this code is different and rather simple. A multiprocessor issues transactions to all DRAM channels. On GTX260, this is achieved with 1 instruction (load int2) executed over 224 threads. This takes place over 28 cycles and reads 256x7 = 1792 byes. When both load int2’s are issued for the full 448 threads, exactly four sequential 256byte sections are read from each DRAM channel over the course of 112 cycles. 512 threads do not run efficiently on the GTX260, since each MP would be issuing four read to most channels, and six reads to one. During operation, 24 multiprocessors are doing this in parallel, each reading from a slightly different part of each DRAM channel.

How this achieves 96% efficiency I do not know, or why this efficiency falls with specific changes (like using 512 threads). Theories are welcome. Possibly the slightly random access pattern is not a problem for the DRAM, which can still do the partial readdressing simultaneously with the data transfer. I’ll have to go through the datasheets to check if this is possible or not. But what bothers me is if the transactions were so neat and atomic, why would it matter to have 448 threads instead of 512? Another possibility is the memory controller does some magical reordering of transactions, to issue them sequentially. Peculiarities of the reordering buffers give the observed behavior. In either case, doing a double read is better because doing a writes involves a drastic readdressing. Yet why doing reads in groups of four or more actually hurts efficiency, I do not know.

Moreover, the mechanics of the observed behavior change between my 8600GT and GTX260. For example, a warm-up seems less necessary on the 8600 and it’s not so sensitive to changes in thread count. On the other hand, the 8600 cannot achieve >90% efficiency. More testing is needed, particularly on a card with more multiprocessors and channels. I honestly did not investigate deeply enough.

But this could be a sign of interesting architectural differences. (Perhaps it wasn’t just better coalescing that spruced up the G200’s memory controller and wasted untold transistors.)

my results:

GPU: Quadro FX 5800
blocks per grid = 30
threads per block = 448
memcpy() size = 256.00 MB
repetitions = 1000

cudaMemcpy() Bandwidth = 82.4 GB/s
dubyMemcpy() Bandwidth = 88.1 GB/s

the memory of the fx5800 is slow :) but u get 4 gig of it …

Btw, please also state your OS and driver.

Also, erdoom, what happens when you run with 512 threads? Your memory bus is actually 512 bits.

Im running on a xp sp3 machine driver ver 180.48

some more results:

C:\Documents and Settings\erir.OPTITEX.000>“C:\Documents and Settings\erir.OPTITEX.000\Desktop\dubyBandwidthTest.exe” 512 320000001 1000
GPU: Quadro FX 5800
blocks per grid = 30
threads per block = 512
memcpy() size = 305.18 MB
repetitions = 1000

cudaMemcpy() Bandwidth = 83.6 GB/s
dubyMemcpy() Bandwidth = 88.1 GB/s

C:\Documents and Settings\erir.OPTITEX.000>“C:\Documents and Settings\erir.OPTITEX.000\Desktop\dubyBandwidthTest.exe” 512 3200001 1000
GPU: Quadro FX 5800
blocks per grid = 30
threads per block = 512
memcpy() size = 3.05 MB
repetitions = 1000

cudaMemcpy() Bandwidth = 76.6 GB/s
dubyMemcpy() Bandwidth = 78.1 GB/s

C:\Documents and Settings\erir.OPTITEX.000>“C:\Documents and Settings\erir.OPTITEX.000\Desktop\dubyBandwidthTest.exe” 512 32000001 1000
GPU: Quadro FX 5800
blocks per grid = 30
threads per block = 512
memcpy() size = 30.52 MB
repetitions = 1000

cudaMemcpy() Bandwidth = 83.0 GB/s
dubyMemcpy() Bandwidth = 89.3 GB/s

C:\Documents and Settings\erir.OPTITEX.000>“C:\Documents and Settings\erir.OPTITEX.000\Desktop\dubyBandwidthTest.exe” 512 82000001 1000
GPU: Quadro FX 5800
blocks per grid = 30
threads per block = 512
memcpy() size = 78.20 MB
repetitions = 1000

cudaMemcpy() Bandwidth = 83.4 GB/s
dubyMemcpy() Bandwidth = 89.8 GB/s

C:\Documents and Settings\erir.OPTITEX.000>“C:\Documents and Settings\erir.OPTITEX.000\Desktop\dubyBandwidthTest.exe” 512 820000001 1000
GPU: Quadro FX 5800
blocks per grid = 30
threads per block = 512
memcpy() size = 782.01 MB
repetitions = 1000