Request clarification on CUDA runtime scheduling


I’m using CUDA for a research project. And I’m launching a kernel with 64 blocks, each block having 384 threads. One would expect that on an average, each SM handles about 4 blocks. However, the profiler claims that between 8 - 11 CTAs are being launched, for different runs of the same kernel with the same configuration. I understand that the profiler collects counter values from ONE SM only. But I consistently get this figure each time I execute the kernel.

Note that I try to ensure (through some preprocessing) that each thread block has a similar amount of work, so each block ought to take the same amount of time to execute (within reasonable error margins) :)

Could someone please give me some insight on how thread blocks are scheduled on multiprocessors by the CUDA runtime? And if this can be controlled (directly or in some contorted way)?

EDIT: I’m using Debian etch 64 bit with a GeForce 8800 GTS 512, The machine is a dual processor quad core xeon@2.83GHz with 16 GB of RAM. I’m using the RHEL 4.x version of the CUDA toolchain, which seems to work well with Debian Etch (at least I’ve had no problems with it so far :) )

Thanks in advance,

as far as I know, N blocks get loaded on each multiprocessor at start (depending on the amount of used registers & shared mem, with N < 8 on hw < 1.3). After that when a block finishes, another block is scheduled. There is no way to control this.