Request clarification on CUDA runtime scheduling

Metalash · September 5, 2008, 10:09am

Hi,

I’m using CUDA for a research project. And I’m launching a kernel with 64 blocks, each block having 384 threads. One would expect that on an average, each SM handles about 4 blocks. However, the profiler claims that between 8 - 11 CTAs are being launched, for different runs of the same kernel with the same configuration. I understand that the profiler collects counter values from ONE SM only. But I consistently get this figure each time I execute the kernel.

Note that I try to ensure (through some preprocessing) that each thread block has a similar amount of work, so each block ought to take the same amount of time to execute (within reasonable error margins) :)

Could someone please give me some insight on how thread blocks are scheduled on multiprocessors by the CUDA runtime? And if this can be controlled (directly or in some contorted way)?

EDIT: I’m using Debian etch 64 bit with a GeForce 8800 GTS 512, The machine is a dual processor quad core xeon@2.83GHz with 16 GB of RAM. I’m using the RHEL 4.x version of the CUDA toolchain, which seems to work well with Debian Etch (at least I’ve had no problems with it so far :) )

Thanks in advance,
Metalash

E.D_Riedijk · September 5, 2008, 1:58pm

as far as I know, N blocks get loaded on each multiprocessor at start (depending on the amount of used registers & shared mem, with N < 8 on hw < 1.3). After that when a block finishes, another block is scheduled. There is no way to control this.

Topic		Replies	Views
Scheduling of thread blocks on Stream Processors CUDA Programming and Performance	9	11065	June 7, 2010
How do the thread blocks resides in the multiprocessors? CUDA Programming and Performance	4	2045	April 16, 2012
How blocks will be distributed among SPs ? CUDA Programming and Performance	4	1562	October 13, 2008
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2832	October 27, 2008
Ensuring blocks per SM CUDA Programming and Performance	4	1109	February 20, 2012
Which entity will execute one block? A single Cuda core or a SM? CUDA Programming and Performance	13	17148	December 7, 2010
Why I can't use all the multiprocessors CUDA Programming and Performance	13	2892	June 15, 2009
What resources are needed for a block to run? CUDA Programming and Performance	9	3173	May 21, 2009
cuda profiler: cta_launched? what does it measure and why? CUDA Programming and Performance	1	8272	February 22, 2008
CUDA Profiler: cta launched counter CUDA Programming and Performance	4	9656	March 16, 2011

Request clarification on CUDA runtime scheduling

Related topics