Giving up? That word is not known to me.
If you take one card with, say 12 SMs like the 9600GSO, you can easily launch a grid consisting of hundreds of blocks, where you hardcode the first N blocks to wait for the rest to do real work (using global atomics spinlocks for example).
if (blockIdx < N)
// spinlock until algorithm is finished
// do real work based on blockIdx - N as "true" block index.
N should be < 12 of course. So you can have your algorithm execute on 12-N cores.
However you have to make sure that one SM only executes one block at a time, for example by allocating > 8192 of shared memory per block.
Or by using enough registers to not allow for a second block.
Cheers, you can quote me when you get the Nobel prize for your work.