Launch Bounds Problem

Hello to all,

I have two queries regarding the Launch Bounds (CUDA_C_Programming guide for CUDA 4.0 RC2)

In the CUDA_C_Programming guide for CUDA 4.0 RC2 page 143 reads

“If launch bounds are specified, the compiler first derives from them the upper limit L on the number of
registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if
minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor
(see Section 4.2 for the relationship between the number of registers used by a kernel and the number of
registers allocated per block). The compiler then optimizes register usage in the following way:…”

Q1. ) What if the upper limit L on the number of registers the kernel would use exceeds the number of registers
available per multiprocessor ? Nothing is mentiond about this in the guide.

Q2.) If the launch bounds are evaluated and optimization is done at compile time, will the compile fail even if ‘execution configuration’
( <<<n_blocks_grid,m_threads_block>>> is such that n_blocks_grid = minBlocksPerMultiprocessor and m_threads_block < maxThreadsPerBlock) ?

TIA,
AShish

Q1
if the number of variables in use at any point in time exceeds the number of registers a thread can have then the excess are mapped to local memory (which is actually a portion of global memory and has a relatively long latency)

Q2
<<< n_blocks_grid, m_threads_block>>>
n_blocks_grid can be up to (64k,64k,0)
if m_threads_block is less than maxThreadsPerBlock it should compile

NB m_threads_block: is best if it is a multiple of 64, (if not a multiple of 64 then better to make it a little under a multiple of 64 than a little over a multiple of 64, if your data or application you are coding allows you to choose )