Total number of threads corresponding to launch configuration

Hi, I’ve been profiling our ray-tracing kernel with Nsight Compute and I have some difficulties parsing the readings related to ‘Launch Statistics’ metrics.

That is, I tried shuffling the order of my launch dimensions to test how it affects the performance of the kernel, and Nsight Compute reported a noticeably different total number of threads, launched to process the kernel.

Let’s assume I have a [dim_x, dim_y, dim_z] launch configuration. Nsight reported a total number of (dim_xdim_ydim_z) threads, with a block size of 64 threads. Shuffling the order to, e.g., [dim_z, dim_y, dim_x] results in a larger number of threads, still with a block size of 64.

The former launch configuration is quite a bit faster, which I initially attributed to reduced thread divergence (number of various instructions issued during the execution of the kernel was reduced substantially, according to Nsight Compute). But now I am wondering if it has anything to do with these extra threads that pop up for the latter configuration.

Where do these extra threads come from?
Are these dummy threads, or they do anything, and thus could be responsible for the elevated number of issued instructions for the second configuration?

EDIT: I am using Optix 6.0, driver 436.30

How are you launching exactly in both cases? Using rtContextLaunch3D()?

The launch size should be exactly the product of the launch dimensions, no more, no less.