HI all,
I am running an Nbody custom made kernel on the Tesla C1060 card.

I am seeing a strange behavior and not being bale to understand why.

If my number of bodies is 8192 then the performance is 393 gflops with the block size of 64 threads ( I am faster than the nvidia’s implementation , i use the same memory model as in the gpgpu paper ) , but if I run the kernel for 7680 bodies my performance increases to 470 gflops ( it should decrease intuitively).

This is confusing as the the number of threads per block are same (64 for 8192 bodies and 7680 bodies) in both the cases. I should see a performance decrease and not increase I guess if I reduce the number of bodies.

The same behavior is noted for other 2^n type number of bodies as below:
Numboer of bodies:
4096 gflops = 290
3840 gflops = 370

7680 gflops = 470
8192 gflops = 393

15360 = 485 gflops
16384 = 390 gflops

etc…

The number of threads per block is same for each pair above and is such that total number of blocks is > 100 .

I noticed a pattern in the above that–> 3840 is 256 less than 4096 , 7680 is 512 (2562) less than 8192 and 15360 is 1024 (2563) less than 16384 etc … So there definitely something fishy.

I am not CS guy… so a :mellow: :mellow: :mellow: :mellow: :mellow: :mellow: m having hard time figuring out what exactly is happening ? Any help would be greatly appreciated…

OKAY I ran the nvidia’s built in nbody code in benchmark mode and it also showed the above pattern as my code. So theJNvidia people should be knowing this ; please tell me why is this happening ?

I don’t understand why “intuitively” gflops should decrease for a smaller N.

I would expect it to fluctuate as N varies, because memory access speed varies depending on alignment and coalescing, and possibly contention between the multiprocessors.

A potentially interesting experiment: lay out the buffers just as you would for computing N=7680, but only calculate the portion corresponding to N=4096.