Got the GTX 285 working on my Mac Pro (2009), but I was expecting better performance.
I have a simple kernel to add a large array of floating point numbers similar to the scan sample program. On the GT 120, it can add 1,000,000 floating point numbers in 2.84ms, on the GTX 285 is only took 2.40ms. That’s only about a 15% increase in performance. Going from 32 cores to 240, I was expecting a bigger boost. :-(
The program uses a single 512 thread block. Would more than one block improve performance?
Another observation, on the GT 120, the performance is much better (twice as fast) if the display is attached to the GT 120, but with the GTX it doesn’t seem to matter. I also notice that with both the GT 120 and the GTX 285 in the machine, the GT 120 never gets me more than 1.5GB memory copies to pinned memory. With just the GT 120 and the display attached I can get over 5.5GB.
Chris - see the posts on throttling back. This card has multiple speed states and Cuda does not automatically kick it into high speed mode. Nvidia said yesterday thy are working on fix. Meantime try running ./nbody just before your other code as that takes the card to full 1.48 GHz and full memory speed. In default state card is at 0.6GHz and memory speed is way down. Also look at deviceQuery before and after running nbody. We are having early adopter pain and I think it will get sorted.
deviceQuery reports 1.48 GHz, so that’s not the issue.
As I mentioned, I was using just a single block or size 512.
Using 2 or 4 blocks on the GT 120 gets the best performance and 10 blocks on the GTX 285. With 4 CPU on the GT 120, the 2 or 4 makes sense, but 10 surprises me on the GTX 285. It is about 25% faster than 15 blocks. Using multiple blocks, the GTX 285 is performing about twice as fast as the GT 120.
Do you not have enough work to use all of those blocks or something? With 10 blocks you’re using a third of the card, so the only thing I can think of is just additional overhead to spawning blocks that aren’t contributing meaningful work.
(and you almost certainly don’t have enough work to hide memory latency)
I’m adding 1,000,000 floating point numbers. I think I tried 100,000,000 as well and got the same results, but I’ll verify tonight. So each block is adding 100,000 floating point numbers.
Ends up 20 is the optimal number of blocks. With 20 blocks, the 100,000,000 floats can be added in 2.73ms on the GTX 285.
Interestingly enough 20 appears to be the optimal number of blocks for the GT 120 as well. With 20 blocks, the 100,000,000 floats can be added in 18ms.
So it looks like I’m getting 6.5x better performance out of the GTX 285 is much better than where I started out, but with the clock being 3x faster and there being over 7x as many core, I’d still expect more of a boost.
I’m even more curious now why using 20 blocks appears to be optimal for both cards. :-)
Do I need to worry about any kind of issues have both cards in the machine?
Since I have 30 processors on the GTX 285, 30 blocks should perform faster than 20, but 30 is almost 50% slower.
Could there be something wrong with the driver that’s causing it to only use 20 of the 30 processors? Is there any way to verify what processors are being used?