I’m trying to profile and optimise our ray-tracer built with Optix 5.0, particularly the fact that we’re seeing almost no performance increase from running it on my laptop which has a GeForce GTX 970M, to running it on the production server which has a Tesla K80. Now the Optix documentation has some simple useful bullet points for optimisation, but it’s really hard to find more detailed information regarding how structure of the Optix application may affect the CUDA kernel performance.
After using the Visual Profiler it seems quite clear that GPU utilisation is being limited by register usage. The megakernel is using 72 registers per thread resulting in only 2 blocks being executed concurrently out of a max of 32 (on my 970M). Now this may be why we’re seeing no speedup on devices with increased CUDA cores so while looking into it I thought, well maybe it’s because we’re performing iterative tracing as opposed to recursive. So I decided to profile some of the precompiled examples and found the same thing, the lowest I found was the path tracer example that used 64 registers per thread.
So really my questions are; is this a problem with the base megakernel ? Is it stopping better GPU utilisation ? And is it stopping us from seeing performance improvements on better GPU’s ? (I realise that one is potentially difficult to answer given lack of info). Is better GPU utilisation even possible with the Optix API or only possible with the Optix Prime API ?
A GPU utilisation guide for Optix would be super useful.