On my new GTX 285 card, I’m puzzled by the fact that this simple program seems to run faster with 20 blocks (almost 50%) than it does with 30. Since this card has 30 SMs, 30 should perform faster.
Can someone else verify that they are seeing the same behavior on their card?
The file just contains the source code. You can make it and run it as follows:
./avg --blocks=20
./avg --blocks=30
Each of these will sum up 100,000,000 floating point number.
You can also run it as:
./avg --test 2> test.csv
test.csv will contain a test with 1 to 250 blocks.
Sure thing, now that I have a working 285 thanks to you :-)
Indeed if I interpret right your output, 20 is 80% faster than 30 on my system, and the csv file has values which are so discontinuous it is positively weird. I’ll email it to you as I am not allowed to upload .csvs for some weird reason.
./avg --blocks=20
100000000 Values
512 threads
20 blocks
Device 0: “GeForce GTX 285”
data 2c04000
d_data 3aa0000
Print out results
CPU Sum: -48407
GPU Sum: -48409
Copy data time: 128.994003 msecs.
CPU time: 401.312012 msecs.
GPU time: 2.773000 msecs.
Total time: 131.766998 msecs.
GPU is 144.721237X faster w/out copy
GPU is 3.045619X faster w/ copy
Transfer rate 3.10092 MB/s
./avg --blocks=30
100000000 Values
512 threads
30 blocks
Device 0: “GeForce GTX 285”
data 2c04000
d_data 3aa0000
Print out results
CPU Sum: -48407
GPU Sum: -48409
Copy data time: 128.942993 msecs.
CPU time: 402.808014 msecs.
GPU time: 4.938000 msecs.
Total time: 133.880997 msecs.
GPU is 81.573105X faster w/out copy
GPU is 3.008702X faster w/ copy
Transfer rate 3.10215 MB/s
Thanks for running this. Looks like your CPU time is much slower than mine. What system are you running on?
So at least this indicates it is not a problem with my card. So maybe it is a drive problem? Unless NVidia can explain why we’d see this behavior. I’m trying to get someone on a PC platform to run it as well.
it needs to be rewritten before its results mean anything at all (because cutil timers are useless, among others), so if I have a spare hour this afternoon I’ll do that.
cudaEvents are the only meaningful timers for GPU code. You also shouldn’t be timing cudaMemcpy if you just care about differences in performance between different block configurations, and you should be running several (preferably about a hundred) iterations of each kernel to eliminate any timing weirdness.
Attached is a zip file with a new version that uses events to track time and tries each block size 100 times and averages the time per try. This version no longer uses any command line options to make it easier. It just creates an output file named avg.csv.
I pulled this output into Excel and added a graph. Interesting that the times are all over the place. The results are consistent with the old version including the fact that 20 blocks is much faster than 30. 50 is just a little faster.