Hi @wenzel.jakob, nice to see you!
These are good questions. It depends a bit on what you need, but I’ll try to clarify several scenarios.
For multiple samples, do you want/need each sample’s result to end up in a separate pixel? Or can you average the result of multiple samples in a single thread at the end of raygen before writing the result?
Your question suggests you may want the former. Our SDK sample optixPathTracer does the latter. If you can avoid writing individual samples, and instead perform the reduction in a single thread then we would generally expect this to be faster than storing the results for each sample separately, regardless of the thread-pixel mapping. It’s perhaps assumed that each pixel in this case takes the same number of samples, to minimize thread divergence.
We tile the 2d launch into warp-sized tiles in OptiX because there is usually some performance benefit from doing so for primary rays, and it almost never hurts. And yes, the benefit comes from ray coherence: usually memory loads and cache hit rates for BVH, geometry, shaders, and textures are all higher for rays that are very similar compared to rays that go different directions, just because the rays tend to traverse similar spaces and hit similar locations in the scene. For multi-bounce path tracing, especially for diffuse materials, this tiling scheme doesn’t provide much benefit, and I don’t have any clever tricks or good advice to offer in that case- the speedup is mostly determined by the camera rays, and the amount of speedup you get may be limited by your average path depth.
With a 1d launch, we don’t do any tiling, but you can do this tiling yourself, and verify whether you see some benefit. If you need to store each sample separately, then whatever your mapping is, try to arrange it so that consecutive thread ids correspond to very similar rays. This would mean putting (for example) jittered primary ray samples consecutively.
With an OptiX 2d launch, if you use
width*height*numSamples number of threads, then you would probably want to arrange your samples into blocks or tiles so all samples for a pixel are in a tile, which would happens automatically if you render a large image. For example, if you render a 1080p image with 16 samples per pixel, you could use a 2d launch of
(4*1920, 4*1080) ==
(7680, 4320). This way you can get the tiling benefit without any change to your indexing at all. The only thing you’d need to do is run a separate reduction kernel on the 4x4 image tiles.
We don’t tile the 3d launch. With a 3d launch, given a launch size specified by (width, height, depth), by default the launch is depth-major, followed by row-major, so the X (or width) coordinate is the inner-most index. This means you’d want to map your multiple camera ray samples to the X / width parameter in order to have coherent rays grouped into warps.
So, advice is to not map samples to individual threads, if you can (understanding that this might not be possible in your case if your postprocessing reduction is more sophisticated than a simple averaging and/or if pixels need data from neighboring pixels). Not mapping samples to threads would mean use width*height number of threads for your launch, and for samples use a loop in raygen. The next easiest fallback is to render a larger image with a 2d launch that you can downsample/reduce with a post-kernel, but comes with the downside of needing a constant, rectangular number of sub-samples. Third would be a 1d or 3d launch where you pay attention to the indexing and put neighboring samples into neighboring threads.
Does that help? Let me know if any of it is unclear or reveals more questions.