Guessing but maybe you can read and write at the same time, so actually you should be multiplying by 3 not 4. Although this would give you 90gb/sec (which i though is moreless that what generally cuda cards achieve)
I’ve tried the Bandwidth test (in CUDA SDK projects), and i get 46 GB/s Device to Device bandwidth. That’s what i compare to. By the way, this is quite far from the theoretical 67.2 GB/s… :S
Yeah, reaching theoretical bandwidth is rarely, if ever, possible. From what I’ve gathered, 2/3 theoretical is about normal. I get around 48 GB/s. These differences can come from MoBo and driver issues, and other hardware issues. I presume there are other reasons as well, but these are common.
I have tried on a friend’s computer. He’s got a 8800GTX and he gets 80% of its theoretical bandwidth on bandwidth SDK example. Anyway, as you say, it may be coming from somewhere else…
You are right, i’ve done an error…
The thoretical bandwidth is 67.2 GB/s. Getting 56.3 is quite realistic !
The last i don’t get, is why i get a higher result with this kernel than with the bandwith SDK test ?
Anyway… Sorry for my mistake and thanks a lot for your help guys =)
Yep, the original 8800 GTX consistently gets 70 GiB/s bandwidth in kernels like this and the peak is 86 GiB/s.
The bandwidth SDK test is benchmarking using a device to device cudaMemcpy, which a little different than running a kernel so they don’t have to be the same.