I have the same question as Mr. Knapsack.

My code almost always runs on multi-GPU, often with a mix of cards in one machine. The machine I’m typing on now has a GT240, a GTX295, and a GTX480.

Before the core simulation occurs, I need to do some preprocessing and setup of data structures… sorting geometry into buckets, determining voxel data, etc. That preprocessing happens on the GPU. Every GPU needs a copy of this data. I could just have each GPU redundantly compute the same data itself, but it’s more efficient and faster to have one GPU do the work then just share the results with everyone else, and everyone starts the real compute. That’s more efficient than the redundant computes since otherwise the GT240 would be still preprocessing for many minutes while the faster cards were already working.

I examine the properties of each device, and I try to figure out the fastest card and elect it to be the preprocessor.

I do this by looking at compute level (2.0 wins). If two or more cards are 2.0, then the clock rate times SM count is the tiebreaker.

GTX460 breaks this heuristic because of the 48 SPs per SM. If the 460 is faster than the 470 for CUDA (quite possible, we need to bench it!) the clock * SM strategy will pick the wrong card.

This is a really minor concern to be honest, but it’s interesting to bring up now that the performances of the Fermi derivatives are not as simply characterized by just SM count and clock rate.