rtContextLaunch1D with multiple GPUs

I’ve just started using a system with multiple Tesla cards. While most of my OptiX kernels seem to take advantage of both cards, I have one 1D kernel that only gets run on one card and eats up the majority of the computational time.

Am I right in assuming that the kernels get split between cards based on launchIndex.y, and so all of the threads in the 1D launch get forced to the first card because there is no y-dimension? Is there any way around this? For this particular kernel, there’s no logical way to partition the input into two dimensions.

Perhaps the number of rays is too small (even with consistent computational time). The workload is distributed across multiple GPUs keeping into account their computational capabilities regardless of the dimensions of the input data.

Do you mean to ask if I’m not filling up a warp? My 1D launch had a dimension of 4096.

Anyway, I changed my 1D launch to a 2D (1 x 4096) launch, and this did use both of my cards.

I noticed further that when I do a 1 x n launch, I don’t seem to fill the warps with active threads. (I saw this by observing the ray generation program’s elapsed time, as in one of the SDK examples.) It seems that if each warp normally takes an 8x4 chunk of threads, then if the launch is only 1 thread wide, each warp will have only 4 active threads. It seems like it would be more efficient in this case if each warp handled a 1x32 chunk of threads.