3x K6000 same performance as 3x Quadro 5000?

We’ve recently upgraded our evaluation system from Fermi to Kepler based GPUs (3x Quadro 5000 -> 3x Quadro K6000), running under linux x86_64, OptiX 3.5.1, Cuda 5.5 with current drivers.
Unfortunately there has been little (if any) improvement in tracing performance and we are wondering if our application is doing something silly that slows things down? At its core we have a recursive ray tracer, mostly primary and shadow rays (~3 light sources) with some recursion for transparent or reflective surfaces. The main render buffer is an OpenGL buffer object (usage: GL_STREAM_DRAW) shared with OptiX and the flags are set to RT_BUFFER_OUTPUT and format RT_FORMAT_FLOAT4 (tone mapping is done by the OpenGL shader drawing this buffer to the screen). For an average frame around 50 variables are assigned (rtVariableSet*) including the ones that hold the output buffer.

I know this is a fairly vague description, but perhaps there is something jumping out as a bad idea for a multi-GPU setup? Suggestions on how to analyze this better are also very much appreciated. Thanks!

  • How does it perform when comparing one Quadro 5000 against one Quadro K6000, and two vs. two?

There is no OpenGL interop to a PBO when using multiple GPUs.
Checking on PCI-E limits:

  • What’s your resulting image size and absolute rendering frame rate?
  • How many PCI-E lanes are assigned to the individual boards? (All three on 16x slots?)
  • Does it change when changing the output format from float4 (RGBA32F internal format, RGBA FLOAT user format and type) to uchar4 (RGBA8 internal format, BGRA(!) and UNSIGNED_BYTE user format and type. Mind the BGRA user format to hit the fast path!)

Testing if there is anything else going one:

  • Are you saying you’re setting the output buffer every frame? Does it change?
  • When you set these 50 variables per frame, is any of them changing buffers?
  • Do you add any new variables?
  • Are you changing the scene?

If yes, that might involve a validation and recompile which would be slow.

To verify if something like that happens, is there any unexpected amount of time spent when calling validate(), compile(), and launch(0,0) to debug that before the real launch? (Debug only! Don’t ever do that when nothing changed, except once at the start.)

  • How about acceleration structure refitting?
    Are you marking any acceleration structures dirty for an animation etc.?

Thank you for the valuable suggestions for discovering the bottleneck!

The machine is currently being shipped for a demo, so it’ll take a little time until I can do performance tests again - sorry, bad timing on my part :-/

I’m not sure I fully understand this (or rather what its consequences are). Are you saying I shouldn’t bother with creating an OpenGL buffer object as output for the rendered image?

1280x1024 and we get about 25 FPS

I’ll give that a try and report back.

It is the same buffer, I’m currently not avoiding redundant rtVariableSet*() calls. Is that a generally valuable/recommended optimization? Is it more important to do for buffers than other variable types?

2-3 variables refer to buffers (depending on what anti aliasing methods are active), however, they are mostly assigned the same buffer as in the previous frame.

No, only values change, no changes to the scene structure.

There is one moving object (i.e. its transform changes between frames) and the root node’s acceleration structure is being rebuilt - number of children of the root is in the low hundreds. Thanks again!