Is it possible to be sending data from host to device on the fly? Like a stream of tasks that the Host is generating and passing them to the device in some kind of buffer?
You could implement something like that using Streams. You would need two or more streams (you can have up to eight I think), and always copy one task and start the kernel for that task in one stream. This way newer Hardware (G92 onwoard) could overlap copying of the task with running the kernel for another task.
For my application which relies on the streaming API, the limit seems to be 24 streams (CUDA 2.2 running on a GTX260). Any more and the program will eventually
encounter launch failures.
Using streams, you will primarily get the advantage of having the CPU thread quickly return to perform more work, launch jobs in new streams, etc. In order to gain
much from the concurrent copy/execution, you will have to balance the time used for memory copies and kernel runs.