Using multiple GPUs to scale an existing Cuda application - failing to allocate memory

I work on a Cuda code base for image processing and am working on scaling up an existing application as we start using larger imagers. I’m having trouble using multiple GPUs and am unsure what is going wrong.

I am using a Windows 10 system with 2 x Titan XP, Cuda 9.2, driver version 399.07, Visual Studio 2015.

Due to confidentiality policies at my company, I have to be careful about how much detail I give about the application. The general summary is that we receive anywhere from 10 to 100+ images as input, do some processing, and output a single result image. We allocate one large chunk of memory once, during initialization, and re-use it every time a new set of images is processed. The C++ code for the GPU processing is wrapped in C# code for the user application and image acquisition.

Traditionally, there has been once instance of this processing pipeline initialized at a time. Recently, it’s been changed so that there can now be multiple instances, each on a separate C# thread. The goal was to have each thread be able to use an arbitrary GPU in a multi GPU system, but the problem I’m running in to is that the total memory allocations seem limited to 12GB, regardless of which GPU is in use.

To try and explain, let’s use a 3 instance setup on our two 12GB Titan XP cards. If we need 3.5GB of memory per instance, we can set it up on one card like this:
Instance 0 → GPU 0 → 3.5GB
Instance 1 → GPU 0 → 3.5GB
Instance 2 → GPU 0 → 3.5GB
10.5 GB total on GPU 0 Success
Or on two cards like this:
Instance 0 → GPU 0 → 3.5GB
Instance 1 → GPU 1 → 3.5GB
Instance 2 → GPU 0 → 3.5GB
7GB on GPU 0 and 3.5GB on GPU 1 Success

And both of these setups work as expected, using both GPUs.

However, if we require 4.5GB per instance:
Instance 0 → GPU 0 → 4.5GB
Instance 1 → GPU 0 → 4.5GB
Instance 2 → GPU 0 → 4.5GB → can’t allocate, fails as expected
Two cards:
Instance 0 → GPU 0 → 4.5GB
Instance 1 → GPU 1 → 4.5GB → Fails right here
Instance 2 → GPU 0 → 4.5GB

It seems like I should be able to use 9GB on one card and 4.5 independently on the other, but I can’t. What’s particularly weird is it fails during the initialization of Instance 1, on the second card, before the memory on the first card would even have been filled up.

Anyone have any suggestions? I’m going to work on creating a small program to demonstrate this issue, but any help would be greatly appreciated. I also have to track down the exact error message at this spot; it’s currently getting translated in to a less useful error code in our application.

Are you using the TCC driver or the WDDM 2.x driver (which is default on Windows 10)? If you are not using the TCC driver, try switching to that. Note that by definition, a GPU under control of the TCC driver cannot service the GUI, so you may need a (low-end) GPU for that, reserving the two Titans for compute.

How much system memory is in this system? Windows 10 WDDM 2.0 requires backing store for all GPU memory (I think, I don’t know the details).

make sure you don’t have any SLI hardware installed or SW settings enabled. Disable SLI, completely, both from a HW and SW perspective.

Thank you both for your replies.

There is no SLI hardware installed and it is disabled in the Nvidia control panel.

The computer has 32GB of memory but I am able to upgrade to 64 and will try that. After looking at WDDM 2.x documentation, I have a strong suspicion that this is the culprit. I will try configuring the cards to run in TCC mode this afternoon and see what happens.

Regardless of the memory allocation issue, I think you will be happier with the TCC driver because it also eliminates the overhead associated with the use of WDDM drivers. While TCC is not supported for all GPUs, there is no good reason I can think of not to use it where supported.

General recommendation for high-performance GPU-accelerated systems: system memory should be four times the total amount of GPU memory. Depending on the application, you may be able to get away with a system memory only twice the size of GPU memory.

Looks like switching the drivers to TCC mode did the trick. Thank you so much!