Would someone run this program on their GTX 285

On my new GTX 285 card, I’m puzzled by the fact that this simple program seems to run faster with 20 blocks (almost 50%) than it does with 30. Since this card has 30 SMs, 30 should perform faster.

Can someone else verify that they are seeing the same behavior on their card?

The file just contains the source code. You can make it and run it as follows:

./avg --blocks=20
./avg --blocks=30

Each of these will sum up 100,000,000 floating point number.

You can also run it as:

./avg --test 2> test.csv

test.csv will contain a test with 1 to 250 blocks.

/Chris

Sure thing, now that I have a working 285 thanks to you :-)

Indeed if I interpret right your output, 20 is 80% faster than 30 on my system, and the csv file has values which are so discontinuous it is positively weird. I’ll email it to you as I am not allowed to upload .csvs for some weird reason.

./avg --blocks=20

100000000 Values
512 threads
20 blocks

Device 0: “GeForce GTX 285”
data 2c04000
d_data 3aa0000

Print out results
CPU Sum: -48407
GPU Sum: -48409

Copy data time: 128.994003 msecs.

CPU time: 401.312012 msecs.
GPU time: 2.773000 msecs.
Total time: 131.766998 msecs.
GPU is 144.721237X faster w/out copy
GPU is 3.045619X faster w/ copy
Transfer rate 3.10092 MB/s

./avg --blocks=30

100000000 Values
512 threads
30 blocks

Device 0: “GeForce GTX 285”
data 2c04000
d_data 3aa0000

Print out results
CPU Sum: -48407
GPU Sum: -48409

Copy data time: 128.942993 msecs.

CPU time: 402.808014 msecs.
GPU time: 4.938000 msecs.
Total time: 133.880997 msecs.
GPU is 81.573105X faster w/out copy
GPU is 3.008702X faster w/ copy
Transfer rate 3.10215 MB/s

Thanks for running this. Looks like your CPU time is much slower than mine. What system are you running on?

So at least this indicates it is not a problem with my card. So maybe it is a drive problem? Unless NVidia can explain why we’d see this behavior. I’m trying to get someone on a PC platform to run it as well.

Thx.

/Chris

/Chris

it needs to be rewritten before its results mean anything at all (because cutil timers are useless, among others), so if I have a spare hour this afternoon I’ll do that.

So what should I be using instead for the timers. You don’t need to work on that. I can do it.

I used the timers the same as I saw them in the examples. What should I be using for timers?

/Chris

Accurate or not, the results are fairly consistent. Seems like the fact that 20 blocks is faster than 30 would be directionally accurate.

Chris

cudaEvents are the only meaningful timers for GPU code. You also shouldn’t be timing cudaMemcpy if you just care about differences in performance between different block configurations, and you should be running several (preferably about a hundred) iterations of each kernel to eliminate any timing weirdness.

Attached is a zip file with a new version that uses events to track time and tries each block size 100 times and averages the time per try. This version no longer uses any command line options to make it easier. It just creates an output file named avg.csv.

I pulled this output into Excel and added a graph. Interesting that the times are all over the place. The results are consistent with the old version including the fact that 20 blocks is much faster than 30. 50 is just a little faster.

Thanks,

/Chris

avg2.zip (16.2 KB)