On my new GTX 285 card, I’m puzzled by the fact that this simple program seems to run faster with 20 blocks (almost 50%) than it does with 30. Since this card has 30 SMs, 30 should perform faster.
Can someone else verify that they are seeing the same behavior on their card?
The file just contains the source code. You can make it and run it as follows:
Each of these will sum up 100,000,000 floating point number.
You can also run it as:
./avg --test 2> test.csv
test.csv will contain a test with 1 to 250 blocks.