multihead rendering performance issue

This is a bit long, so bear with me…

We are currently in the final stages of a simulator project and are optimizing to get the system performance to where we want it. We have already made significant progress but have hit the figurative wall where rendering times are concerned.

The target hardware is a multi-headed openSuSE Leap 42.1 server using a 7900X 10 core cpu, 32 Gb DDR4 RAM and four GTX1080Ti graphics cards. X is currently setup with independent Xserver/screen layouts, one for each graphics card. Each card is configured to output two signals of 1920x1080, with each screen being 1920x2160.

The Unigine Heaven 4.0 benchmark on most extreme settings@1920x2160 runs at ~41 frames/sec when running four simultaneous instances on this hardware. Running less instances results in higher framerates: 45 for three, 49 for two and 53 for a single instance. This is not yet troublesome but noteworthy since each instance has its own dedicated gpu and there should be enough cpu cores to not have the instances compete for cpu time.

Kernel version 4.1.36
Nvidia driver 384.90

Our application setup (also using Unigine engine) consists of a master application instance and three slave application instances, each using one of the four Xserver/graphics cards combo’s. Communication between the instances is done through shared memory. The main thread of each instance is responsible for rendering while the simulation itself and other cpu heavy work is performed in other threads in the master instance. So frame rate is mostly directly dependent on Unigine rendering speed.

Our application when run as a single instance runs at between 45-75 frames/sec currently in general. On a specific test location it runs at 60 frames/sec. (all ms numbers below are on this location)

Now for the conundrum: each time an instance is added the frame rate drops considerably. Adding one it drops below 50, two below 30 and at three it’s crawling along at 15 frames/sec.

Using microprofiler we confirmed the rendering to be 95% of the time spent. With four instances about 60ms is spent in on the master instance. With three it’s 45ms, two 30ms and just the one 15ms. It actually looks like something is scaling perfectly, aside from the fact there are four independent gpu’s… The slave instances have shorter times (but still too long compared to the single rate of 60Hz) and spend the remainder waiting on the master. Note that the drops are not caused by the shared memory syncing, four independent instances without any syncing exhibit the same performance degradation.

Note as well:

  • a development laptop (msi gs63vr) with a 4 core cpu and a 1060 mobile is able to run all four instances at about 18 frames/sec. So a couple frames faster!

  • a development pc with a 6 core cpu and a 1070 is able to run all four instances at roughly 30 frames/sec. So considerably faster.

  • disabling a warping module (eliminating warping/blending shader and halving the total resolution from 1920x2160 to 1920x1080) has absolutely no effect on framerate. Frame time remains at 60ms.(!)

  • the gpu’s are idling. In fact the utilisation is so low that the fans don’t even spin up because of the low core temperatures of around 55 degrees.(!) Running the Heaven benchmarks has temperatures at over 85 degrees constantly.

  • from a visual estimation (using xosview) the cpu utilisation is between 25-50% at max with not even a single core being maxed constantly.

  • if we enable a mirror (complete extra scene render pass) in a slave instance, this seems to have the same effect on overall frame rate as adding a slave instance. So one slave with mirror equals two slaves with no mirrors.

  • application is started with vsync disabled (__GL_SYNC_TO_VBLANK=0)

  • changing Xserver configuration to single head with four screens makes no difference. updating to latest nvidia driver, no difference.

We’re now officially stumped. All our heavy lifting is out of the render time-critical path so we cannot optimize this away in our code. And even though it seems unlikely, both from the perfect scaling (15->30->45->60) and the fact that halving the gpu (pixel) workload has no effect whatsoever it leads us to the thought that possibly Unigine is spending time doing nothing somewhere? Even though the gpu is not doing much either.

Could the driver (kernel module) be a bottleneck here? How would multiple processes using separate gpu’s on seperate Xservers still impact each other?

Any insights or advice, both as to the cause or how to proceed finding the cause would be greatly appreciated!