How to understand and optimize register usage for Optix MegaKernels ?

Hi,

According to NVIDIA Nsight, in the optixPathTracer example, the megaKernel0 uses 64 registers per thread (one of the bottle neck). By dummying the pathtrace_camera() and making all the other function bodies empty, it still uses 28 registers per thread. The pathtrace_camera function body is dummied like this:

RT_PROGRAM void pathtrace_camera()
{
    float zero = 0.0f;
    Ray ray = make_Ray(make_float3(zero, zero, zero), make_float3(zero, zero, zero), zero, zero, RT_DEFAULT_MAX);
    rtTrace(top_object, ray, zero);
}

As we known, OptiX has some built-in code such as jump tables that consume some registers. However, the cost of OptiX’s black box is sometimes big and difficult to control. For example: then, by changing GeometryGroup to empty in optixPathTracer.cpp:

//GeometryGroup shadow_group = context->createGeometryGroup(gis.begin(), gis.end());
GeometryGroup shadow_group = context->createGeometryGroup();

and

//GeometryGroup geometry_group = context->createGeometryGroup(gis.begin(), gis.end());
GeometryGroup geometry_group = context->createGeometryGroup();

The register usage is increased from 28 registers per thread to 55 registers per thread !? It means 55 registers are used, but just for doing nothing. Of course, this change is only for narrowing down the problem, but it also reflects, some OptiX API usage on the host side may impact the register usage a lot. More tricky thing is, if we create a non-empty geometry group, but do not use it (do not set it to any variables nor parent nodes), the registers usage comes back to 28! I don’t understand this at all. This example may just reveal one of the many problems.

Moreover, after removing the rtTrace function, there are still 16 registers per thread in use. It means:

  1. 16 registers are used anyway if functions are all empty from users' PTX.
  2. 28 - 16 = 12 more registers are used due to rtTrace.
  3. 55 - 16 = 39 more registers are used due to rtTrace if GeometryGroup is empty.

16 registers usage seems constant. But the additional register usage with rtTrace may change a lot with the same users’ PTX code but with different OptiX host code. The problem here is, users may able to optimize their PTX code, but if the majority register usage is from OptiX’s black box, then we(users) could not do much about it.

Registers usage can easily become a bottle neck. Are there any debuggers / profilers for dumping out the stack and register details ? I read from https://devtalk.nvidia.com/default/topic/1026591/optix/how-could-i-debug-the-code-in-cuda-files-in-a-optix-project-/ and sadly to know Nsight can not help on this. It seems rtPrintf and rtThrow can not help as well. Maybe I missed something.

I hope there could be some guide or best practice for understanding and optimizing the register usage. For example, to explain what happens after first call of rtTrace in ray gen functions, why there are so many registers used, and how users could optimize them.

Please help and thanks a lot.

Spec:

GTX 1080 Ti X 2
GPU Driver 417.35
NVIDIA Nsight 6.0
OptiX 4.1.1
CUDA 10.0