I was looking through the ‘reduction’ SDK example and noticed something strange. If you run the example on, say, kernel 1 (change line 421 in reduction.cpp), when the kernel launches for the first time, it launches using the grid dimension (0x10000, 1, 1) and the triple-angle-bracket notation. If you tried to launch the same exact kernel with the same dimensions, only using cuLaunchGrid(), the method immediately returns CUDA_ERROR_INVALID_VALUE.
Section G.1 of the CUDA 3.2 Programming Guide (page 164/179) says “Maximum x- or y-dimension of a grid of thread blocks” is 65535 (0x0ffff).
Why does the <<< >>> syntax allow for gridDim.x = 0x10000 when the driver method calls only support up to 0x0ffff?