Profiling - 3x K6000 not fully utilized

Hello,

when running my application and looking at the nvidia-settings panel it shows that my 3 K6000 only have a load of around 25% and my single Quadro 5000 that drives the projection is at a similar load level. PCIe load is shown around 30%. I did a quick test with nvprof, but it seems of limited use with OptiX since all time (>95%) is shown to be spent in __globfunc__Z7trace_0v, which I assume is the kernel assembled from my PTX programs. Is there a better way to find out where my application is spending most of its time?

System: Optix 3.8.0, Cuda 7.0 on Fedora 21 x86_64. 3x K6000, 1x Quadro 5000 (not used in OptiX context), driver 346.87.

I’ve found Nsight to be a very useful tool for profiling my application using OptiX. It will also not give you resolution below the main kernel, though.

Sounds as if your situation has not changed since your last question on this topic.
https://devtalk.nvidia.com/default/topic/747393/3x-k6000-same-performance-as-3x-quadro-5000-/#4227293

That could still be anything at this time.

The two big things to check first would be:
Either the data transfer is limiting or the ray tracing is not efficient.

That can easily be determined.

  • Test if the GPU load changes when not transferring things to the (different architecture) GPU
    when

  • ray tracing on a single GPU,

  • on all three combinations of two of your K6000 (to determine if they are connected differently e.g. 16 vs 8 PCI-E lanes etc.),

  • on all three of them.

  • Check if you’re able to get >90% load on the ray tracing GPU when only using one GPU with no transfer.

If you are not able to get the ray tracing load >85-90% on a single board when not transferring data, you shouldn’t expect multiple GPUs to reach that load either. Keep working on your ray tracing algorithm until that is actually limited by the GPU, then add more boards.

If it’s the data transfer, you’d need to find ways to speed up the transfer or reduce the traffic.