Trouble Selecting single GPU on dual GPU system

This issue has been posted about already at cuda - 'cudaMalloc' unintentionally allocating memory on multiple GPUs instead of just 1 - Stack Overflow and linux - Unwanted Pytorch data duplication across multiple GPUs - Stack Overflow but I’ll write a summary here.

At my work we have 3 machines with very similar setups, let’s call them a, b, and c. On machine b we don’t seem to be able to get CUDA code to run on a single GPU.

The person who first noticed it helpfully wrote the following code to isolate the issue

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s <GPU_ID> <Memory_Size_GB>\n", argv[0]);
        return 1;
    }

    int gpu_id = atoi(argv[1]);
    size_t memory_size_gb = atoll(argv[2]);

    
    // Set the GPU
    cudaError_t cudaStatus = cudaSetDevice(gpu_id);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    // Convert GB to bytes for memory allocation
    size_t size = memory_size_gb * 1024 * 1024 * 1024;

    // Allocate memory on the GPU
    void *gpu_memory;
    cudaStatus = cudaMalloc(&gpu_memory, size);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed to allocate %zu bytes!\n", size);
        return 1;
    }

    printf("Allocated %zu GB of memory on GPU %d\n", memory_size_gb, gpu_id);
    printf("Press any key to free memory and exit...\n");

    getchar(); // Wait for key press

    // Free the memory when done
    cudaStatus = cudaFree(gpu_memory);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaFree failed!");
        return 1;
    }

    return 0;
}

If this is compiled as gtest and run as ./gtest 0 5 or ./gtest 1 5 on machines a and c then 5 gigabytes of memory is allocated specifically on the 0th or 1st GPUs respectively, as expected. However if run on machine b then both cause 5 gigabytes to be allocated on both graphics cards.

If used in confuction with CUDA_VISIBLE_DEVICES then running CUDA_VISIBLE_DEVICES=0 ./gtest 0 5 and CUDA_VISIBLE_DEVICES=1 ./gtest 0 5 runs as you’d expect on machines a and c, but on b specifying 0 gives a segmentation fault and 1 allocates 5 gigabytes on both graphics cards.

Not sure if relevant but the software-version/cards on machine a are:

NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7
$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)

on machine b

NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7
$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GP102GL [Quadro P6000] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102GL [Quadro P6000] (rev a1)

and on machine c

NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6
$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)

On a post someone suggested checking SLI. I noticed on machine b there was an error about Auto not being a supported mode in Xorg.0.log. I changed it to mosaic with no effect. But also later noticed that machine c has the same error and is working fine.