Not to keep harping on this, but your measurement method or interpretation might be misleading you here.
CUDA does not run kernels on multiple devices unless you do a little work to set this up. When you call your first CUDA function in your program (assuming you are using the runtime API), the library establishes a CUDA context for you, by default with device 0, whatever the first GPU is in your computer. All kernels you invoke will execute on this device for the entire life of the host thread. (With the driver API, you can destroy a context and create a new one associated with a different device.) You can pick a different device by calling cudaSetDevice() before any other cuda function.
To run kernels (different invocations, not one giant grid) on multiple devices, you have to first create 1 host thread per CUDA device, then call cudaSetDevice() in each host thread with a different value. Then each host thread is free to queue up kernels, which will automatically go into the queues of their respective CUDA devices. At the moment, a single threaded program cannot use multiple CUDA devices at the same time.
If this is what you are doing, feel free to tell me to buzz off. :) Just wanted to be clear.
Yes, this is called the “fat kernel” trick, where you run completely disjoint code with an if statement that depends on the value of blockIdx.x. The disadvantage to this approach is that if your two tasks take different amounts of time, your fat kernel will run for the longer of the two tasks, leaving some of the chip idle. In addition, the block scheduler is suboptimal when blocks take very different amounts of time to complete, so it is possible that the multiprocessors will be underutilized even while both tasks are executing.
This is why Fermi lifting this restriction will be a big deal. :) I suspect that NVIDIA will expose the multi-kernel ability through their existing CUDA stream concept. Current GPUs already can overlap kernel execution on one stream with a memory copy on another stream. Fermi would just enable kernels to run concurrently on different streams, which is the perfect abstraction for this. What remains to be seen is how they let the programmer control the relative multiprocessor allocation between streams. Probably the easiest way would be a function which let you set the number of multiprocessors a given stream could use. Then you could decide on the relative importance of the tasks you were running and assign multiprocessors to each accordingly. (Hopefully this will make PhysX and 3D coexist happily on one card.)