I work on a Cuda code base for image processing and am working on scaling up an existing application as we start using larger imagers. I’m having trouble using multiple GPUs and am unsure what is going wrong.
I am using a Windows 10 system with 2 x Titan XP, Cuda 9.2, driver version 399.07, Visual Studio 2015.
Due to confidentiality policies at my company, I have to be careful about how much detail I give about the application. The general summary is that we receive anywhere from 10 to 100+ images as input, do some processing, and output a single result image. We allocate one large chunk of memory once, during initialization, and re-use it every time a new set of images is processed. The C++ code for the GPU processing is wrapped in C# code for the user application and image acquisition.
Traditionally, there has been once instance of this processing pipeline initialized at a time. Recently, it’s been changed so that there can now be multiple instances, each on a separate C# thread. The goal was to have each thread be able to use an arbitrary GPU in a multi GPU system, but the problem I’m running in to is that the total memory allocations seem limited to 12GB, regardless of which GPU is in use.
To try and explain, let’s use a 3 instance setup on our two 12GB Titan XP cards. If we need 3.5GB of memory per instance, we can set it up on one card like this:
Instance 0 → GPU 0 → 3.5GB
Instance 1 → GPU 0 → 3.5GB
Instance 2 → GPU 0 → 3.5GB
10.5 GB total on GPU 0 Success
Or on two cards like this:
Instance 0 → GPU 0 → 3.5GB
Instance 1 → GPU 1 → 3.5GB
Instance 2 → GPU 0 → 3.5GB
7GB on GPU 0 and 3.5GB on GPU 1 Success
And both of these setups work as expected, using both GPUs.
However, if we require 4.5GB per instance:
Instance 0 → GPU 0 → 4.5GB
Instance 1 → GPU 0 → 4.5GB
Instance 2 → GPU 0 → 4.5GB → can’t allocate, fails as expected
Two cards:
Instance 0 → GPU 0 → 4.5GB
Instance 1 → GPU 1 → 4.5GB → Fails right here
Instance 2 → GPU 0 → 4.5GB
It seems like I should be able to use 9GB on one card and 4.5 independently on the other, but I can’t. What’s particularly weird is it fails during the initialization of Instance 1, on the second card, before the memory on the first card would even have been filled up.
Anyone have any suggestions? I’m going to work on creating a small program to demonstrate this issue, but any help would be greatly appreciated. I also have to track down the exact error message at this spot; it’s currently getting translated in to a less useful error code in our application.