I have two different kernels. First just performs copy. Second performs copy + division. The bandwidth of second kernel seems to be higher. How is it possible?
For first kernel i got 58723 Mb/s (57.6 gb/s official data). For second kernel i got 80744 Mb/s. Device: 8800GT.
This is a guess, since I’m not sure what cuda.Launch is, but: Are you launching with 1 thread per block? Did you mean to use BlockSize as the number of threads instead of 1?
Yeah, I realize that. But the way you are calling the cuda.Launch function it looks like you are passing 1 as the number of threads, when you should be passing blockSize.
Well, the problem was with formula: size * sizeof(int) * Time / numIterations
It should be: 0.000001 * numIterations * size * sizeof(int) / Time
Now i get next results: 60 gb/s for both kernels on GTX 275.
why it is so far from theoretical (127 gb/s) or from shown by bandwidth test (105000 mb/s)?