How to achieve highest possible global mem bandwidth?

I am using 8800GT with a theorectical global mem bandwidth of 57.6GB/s. However, with 64-bit aligned access pattern, I could only achieve about 42GB/s, about 73% of the theoretical. I’ve seen papers claiming achieving over 80%, and previous post claming 93%. Where I did wrong?

Is this result benched from your app ot bandiwdthTest.exe?

Benched from cudaMemcpy()

My own app achieves almost the same

I remember when trying my hand in beating cudaMemcpy(), I had to call both my kernel and cudaMemcpy many times (1000) with a sizeable transfer size (32MB) before I could get optimal bandwidth.

Although my kernel was half a percent slower than cudaMemcpy(), I got both running much faster than what’s reported in bandwidthTest. I hit >90% efficiency.

I do not know what explains this behavior.

Can you Post your code?

I updated it a bit too.

Thanks!I tried your code.Some questions:

1.Shoudn’t we use 1024 * 1024 * 1024 instead of 1.e9? That just gives you 7% speedup.

2.How effective is the warm-up run? You run 1/5 repetitions, but I think one may be enough. Actually, if repetition is large enough, there is no need for warm-up. However that only stands if the warm-up benefit from TLB.

3.What’s your parameter of the 93% run? And your card parameter?

Hmm, that’s a good catch. However, it seems the “theoretical” numbers are also off. Eg, for GTX260, 2 GHz * 448/8 = 112 billion bytes per second, which is the figure I used instead of 104.3. (Apparently Wikipedia makes the same mistake.) So calculation of efficiency isn’t affected, it’s still 96%.

No, one repetition is not enough. It seems the more the better. The reason why is a big unanswered question.

I gave the exact arguments used.

Thanks for your clarification. Finally understand the difference between “GB” and “GiB”, and wikipedia is surely wrong. Actually 57.6 for 8800GT is in GB/s, not GiB/s, so the utilization is now about 80%. However, I wasn’t able to achieve any higher bandwidth with your code. I used 512 threads per block, copying 128MB 1000 times, cudaMemcpy() achieves 46.3 GB/s, and dubyMemcpy() achieves 44.9GB/s. 8800GT has a mem of 512MB, 256bit.Maybe my parameter is wrong?

no, no, no, NO. GB and GiB do NOT mean different things, they BOTH mean 1073741824 bytes. Do NOT listen to the asswipes who think it’s worth throwing confusion on the whole field of computer science just for the sake of some anal consistency with completely unrelated units of measure. It was my mistake (and wikipedia’s) to use decimals.

You can try running different thread counts, like 256 or 384. But, there’s no fundamental reason my memcpy should be faster. It just was on the two systems I tested. Your datapoint is welcome. What OS/driver are you using?

I’m using Ubuntu 7, driver version 177.82. I had to modify your code a liitle to run(commenting out QueryPerformance***, and using runtime api to time)

Actually, I’ve done all the tricks you did in my own previous code, which includes 64-bit alignment, hand-unrolling(I unrolled more times than you did, and even ignore the remainder part, since it’s not needed for test) and a little warm-up(although only once). They delievered equal utilization on my card. Or, it’s related to my card?