Device-Device memory bandwidth (of aligned types) need cuda source for cudaMemcpy analog

Hi, I am getting internal bandwidth test on 8800 Ultra at 80GB/s. Cool. This is when I run the example “bandwidthTest”:

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 80124.4 External Image

HOWEVER

alignedTypes program outputs the best 31MB/s for aligned I32 or RGBA8. External Image
how can I achive full bandwiths when I am really accessing data, not just memCopying them?

Testing aligned types…
RGBA8…
Avg. time: 1.607352 ms / Copy throughput: 28.970705 GB/s.
TEST PASSED
I32…
Avg. time: 1.467472 ms / Copy throughput: 31.732218 GB/s.
TEST PASSED
LA32…
Avg. time: 1.497365 ms / Copy throughput: 31.098713 GB/s.
TEST PASSED
RGB32…
Avg. time: 20.520443 ms / Copy throughput: 2.269256 GB/s.
TEST PASSED
RGBA32…
Avg. time: 1.744643 ms / Copy throughput: 26.690925 GB/s.
TEST PASSED
RGBA32_2…
Avg. time: 3.617115 ms / Copy throughput: 12.873833 GB/s.
TEST PASSED
Shutting down…

The alignedTypes.cu file has

const int totalMemSizeAligned = iAlignDown(MEM_SIZE, sizeof(TData));

...

printf(

        "Avg. time: %f ms / Copy throughput: %f GB/s.\n", gpuTime,

        (double)totalMemSizeAligned / (gpuTime * 0.001 * 1073741824.0)

    );

Notice the lack of a *2 for the number of bytes. Thus, it is only counting the number of bytes copied, not the total read and written. Your actual memory bandwidth numbers are twice what you are seeing.

big thanks, I suspected that.

I have the other issue is that Device-to-Device bandwidthTest on Albatron 8600 GT returns only 7GB/s,

what can be a reason for that? BY overclocking to 900 Mhz I was able to get maximum 10GB/s, but not 20 as it is advertised…

any idea?

and one more thing 32*2 = 64, not 80… so where did the other 16GB/s go?