GTX 285 Performance vs GT 120

Got the GTX 285 working on my Mac Pro (2009), but I was expecting better performance.

I have a simple kernel to add a large array of floating point numbers similar to the scan sample program. On the GT 120, it can add 1,000,000 floating point numbers in 2.84ms, on the GTX 285 is only took 2.40ms. That’s only about a 15% increase in performance. Going from 32 cores to 240, I was expecting a bigger boost. :-(

The program uses a single 512 thread block. Would more than one block improve performance?

Another observation, on the GT 120, the performance is much better (twice as fast) if the display is attached to the GT 120, but with the GTX it doesn’t seem to matter. I also notice that with both the GT 120 and the GTX 285 in the machine, the GT 120 never gets me more than 1.5GB memory copies to pinned memory. With just the GT 120 and the display attached I can get over 5.5GB.

/Chris

Chris - see the posts on throttling back. This card has multiple speed states and Cuda does not automatically kick it into high speed mode. Nvidia said yesterday thy are working on fix. Meantime try running ./nbody just before your other code as that takes the card to full 1.48 GHz and full memory speed. In default state card is at 0.6GHz and memory speed is way down. Also look at deviceQuery before and after running nbody. We are having early adopter pain and I think it will get sorted.

deviceQuery reports 1.48 GHz, so that’s not the issue.

As I mentioned, I was using just a single block or size 512.

Using 2 or 4 blocks on the GT 120 gets the best performance and 10 blocks on the GTX 285. With 4 CPU on the GT 120, the 2 or 4 makes sense, but 10 surprises me on the GTX 285. It is about 25% faster than 15 blocks. Using multiple blocks, the GTX 285 is performing about twice as fast as the GT 120.

Thanks,

/Chris

I would love to know why your sys reports that but my 08 Pro reports 0.6, unless the card is kicked with something 3D.

That is a useful clue.

Ta

Any thought why 10 blocks would be the ideal number of blocks for the GTX 285 with 30 processors?

Thanks,

/Chris

Do you not have enough work to use all of those blocks or something? With 10 blocks you’re using a third of the card, so the only thing I can think of is just additional overhead to spawning blocks that aren’t contributing meaningful work.

(and you almost certainly don’t have enough work to hide memory latency)

I’m adding 1,000,000 floating point numbers. I think I tried 100,000,000 as well and got the same results, but I’ll verify tonight. So each block is adding 100,000 floating point numbers.

Thanks, :-)

/Chris

So making sure there is enough work, I’m adding 100,000,000 floats. Here is the timing based on the number of blocks:

Blocks Time (ms)

1 26.728

2 14.010

4 6.699

5 5.414

8 3.583

10 3.064

11 6.776

15 6.085

30 5.854

I expected to see 15 and 30 as the best performers. Each block is 512 threads.

/Chris

Ends up 20 is the optimal number of blocks. With 20 blocks, the 100,000,000 floats can be added in 2.73ms on the GTX 285.

Interestingly enough 20 appears to be the optimal number of blocks for the GT 120 as well. With 20 blocks, the 100,000,000 floats can be added in 18ms.

So it looks like I’m getting 6.5x better performance out of the GTX 285 is much better than where I started out, but with the clock being 3x faster and there being over 7x as many core, I’d still expect more of a boost.

I’m even more curious now why using 20 blocks appears to be optimal for both cards. :-)

Do I need to worry about any kind of issues have both cards in the machine?

/Chris

Since I have 30 processors on the GTX 285, 30 blocks should perform faster than 20, but 30 is almost 50% slower.

Could there be something wrong with the driver that’s causing it to only use 20 of the 30 processors? Is there any way to verify what processors are being used?

Thanks,

/Chris