Would someone run this on a high end Mac card?

Still trying to understand why the optimal number of blocks is not a multiple of the number of SMs on the 285. Would someone mind running this on on one of the higher end NVidia card and sending me the avg.csv output file?

/Chris

avg2.zip (16.2 KB)

Sorry this and the question are repeated here and in the general computing section, but I finally figured out why I was observing such different behavior at different block sizes.

It ends up the performance was really the result of the starting address alignment for each block. By forcing the address that each block started on to a 64 byte (16 4 byte floats) boundary, I now get much more predictable results. Here is the graph of the performance verses the number of blocks.

Thanks,

/Chris

Can someone complile that to an app / command line app ?
Thanks

You want me to build it and upload the executable? If so, I can do it tonight then I get home.

Thanks,

/Chris

Assuming that’s what you were interested in, attached is the final version of my code with the executable.

The best time I’m seeing is 2.71163 with 178 blocks.

I’ve learned a lot through doing this. I hope it’s helpful.

/Chris

avg4.zip (95.5 KB)