Using multiple GPUs to scale an existing Cuda application - failing to allocate memory

bmikkila · September 4, 2018, 3:54pm

I work on a Cuda code base for image processing and am working on scaling up an existing application as we start using larger imagers. I’m having trouble using multiple GPUs and am unsure what is going wrong.

I am using a Windows 10 system with 2 x Titan XP, Cuda 9.2, driver version 399.07, Visual Studio 2015.

Due to confidentiality policies at my company, I have to be careful about how much detail I give about the application. The general summary is that we receive anywhere from 10 to 100+ images as input, do some processing, and output a single result image. We allocate one large chunk of memory once, during initialization, and re-use it every time a new set of images is processed. The C++ code for the GPU processing is wrapped in C# code for the user application and image acquisition.

Traditionally, there has been once instance of this processing pipeline initialized at a time. Recently, it’s been changed so that there can now be multiple instances, each on a separate C# thread. The goal was to have each thread be able to use an arbitrary GPU in a multi GPU system, but the problem I’m running in to is that the total memory allocations seem limited to 12GB, regardless of which GPU is in use.

To try and explain, let’s use a 3 instance setup on our two 12GB Titan XP cards. If we need 3.5GB of memory per instance, we can set it up on one card like this:
Instance 0 → GPU 0 → 3.5GB
Instance 1 → GPU 0 → 3.5GB
Instance 2 → GPU 0 → 3.5GB
10.5 GB total on GPU 0 Success
Or on two cards like this:
Instance 0 → GPU 0 → 3.5GB
Instance 1 → GPU 1 → 3.5GB
Instance 2 → GPU 0 → 3.5GB
7GB on GPU 0 and 3.5GB on GPU 1 Success

And both of these setups work as expected, using both GPUs.

However, if we require 4.5GB per instance:
Instance 0 → GPU 0 → 4.5GB
Instance 1 → GPU 0 → 4.5GB
Instance 2 → GPU 0 → 4.5GB → can’t allocate, fails as expected
Two cards:
Instance 0 → GPU 0 → 4.5GB
Instance 1 → GPU 1 → 4.5GB → Fails right here
Instance 2 → GPU 0 → 4.5GB

It seems like I should be able to use 9GB on one card and 4.5 independently on the other, but I can’t. What’s particularly weird is it fails during the initialization of Instance 1, on the second card, before the memory on the first card would even have been filled up.

Anyone have any suggestions? I’m going to work on creating a small program to demonstrate this issue, but any help would be greatly appreciated. I also have to track down the exact error message at this spot; it’s currently getting translated in to a less useful error code in our application.

njuffa · September 4, 2018, 4:05pm

Are you using the TCC driver or the WDDM 2.x driver (which is default on Windows 10)? If you are not using the TCC driver, try switching to that. Note that by definition, a GPU under control of the TCC driver cannot service the GUI, so you may need a (low-end) GPU for that, reserving the two Titans for compute.

How much system memory is in this system? Windows 10 WDDM 2.0 requires backing store for all GPU memory (I think, I don’t know the details).

Robert_Crovella · September 4, 2018, 4:12pm

make sure you don’t have any SLI hardware installed or SW settings enabled. Disable SLI, completely, both from a HW and SW perspective.

bmikkila · September 4, 2018, 4:36pm

Thank you both for your replies.

There is no SLI hardware installed and it is disabled in the Nvidia control panel.

The computer has 32GB of memory but I am able to upgrade to 64 and will try that. After looking at WDDM 2.x documentation, I have a strong suspicion that this is the culprit. I will try configuring the cards to run in TCC mode this afternoon and see what happens.

njuffa · September 4, 2018, 5:15pm

Regardless of the memory allocation issue, I think you will be happier with the TCC driver because it also eliminates the overhead associated with the use of WDDM drivers. While TCC is not supported for all GPUs, there is no good reason I can think of not to use it where supported.

General recommendation for high-performance GPU-accelerated systems: system memory should be four times the total amount of GPU memory. Depending on the application, you may be able to get away with a system memory only twice the size of GPU memory.

bmikkila · September 4, 2018, 7:09pm

Looks like switching the drivers to TCC mode did the trick. Thank you so much!

Topic		Replies	Views
Windows 10 using ~1 GB of memory for all GPUs (WDDM) CUDA Programming and Performance	3	5845	October 22, 2017
[980 Ti, Windows 10, CUDA 7.5] Out of memory after allocating 4.5 out of 6gb CUDA Programming and Performance	7	5132	December 6, 2015
Titan V missing memory? CUDA Programming and Performance	5	1011	July 15, 2018
cudaMalloc3DArray out of memory can not allocate the available amount of memory CUDA Programming and Performance	3	1812	January 31, 2011
GTX295 multi GPU programming CUDA Programming and Performance	22	10661	July 9, 2009
Multiple GPUs not working CUDA Programming and Performance	1	818	July 9, 2009
Nvidia 9800GX2 Two cuda processors in one slot? CUDA Programming and Performance	7	6378	March 6, 2008
GTX 1080Ti with CUDA 9.0 Toolkit CUDA Setup and Installation	5	2541	July 7, 2018
host memory allocation failure with GTX 1080 CUDA Programming and Performance	8	1136	September 14, 2016
Trouble Selecting single GPU on dual GPU system CUDA Setup and Installation cuda	0	294	December 14, 2023

Using multiple GPUs to scale an existing Cuda application - failing to allocate memory

Related topics