cudaMalloc Allocating On All GPUs

I have a dead simple program that calls cudaSetDevice to the 0th device, calls cudaMalloc, and then waits for ten seconds. When I run it on a system with four GPUs, running nvidia-smi indicates that it allocated the memory on all four GPUs. What the heck is going on?

Here is the program and nvidia-smi output: Dumb CUDA · GitHub

Hi, welcome to the nVidia forums!

does it make any difference if you change the devices visible to CUDA before launching your binary? Like one of these lines:

CUDA_VISIBLE_DEVICES=3 && ./a.out

CUDA_VISIBLE_DEVICES=0 && ./a.out

CUDA_VISIBLE_DEVICES=1,2 && ./a.out

I just found a decent article that talks in depth about multi-GPU programming, nVLink connections and peer 2 peer memory access. But it doesn’t explain the effect you’re experiencing. Multi-GPU programming with CUDA. A complete guide to NVLink. | GPGPU

Christian

It does not! Very very interesting.

Are these GPUs configured in SLI mode? I notice you seem to have X configured to use all of them.

Seems that Mosaic clones every malloc across all cards. It is on by default, at least in Centos. You can disable it:

nvidia-xconfig --no-base-mosaic

Is this a bug? I can see why Mosaic might clone its buffers but I don’t see the logic for why it extends to cuda. In any case, glad to have a workaround.