Change id assigned to gpu's

Hello,

one of the seven M2070’s in our gpu cluster is acting strange, and I would like to bypass it for the time being by giving it a high device id.

Unfortunately, it’s current device id is 1, so whenever an application uses 2 or more gpu’s, it picks the bad one and crashes. I can identify which one is bad, because I have a test app that can run on a selected device id, and id=1 fails, while all others succeed. (fails on cuCtxSynchronize with driver api and cudaMemcpy with runtime api). I agree it’s strange to be visible, accessible, print the same info as all the others in nvidia-smi -q output, and still crash, but what can I do. The bigger problem at the moment is that this one gpu is blocking 6 gpu’s from being used, just by having a low device id.

I know cudaSetDevice(cpu_thread_id) is not the best way to do things, but it’s deep inside a huge application and I was wondering if I could allow users to use up to 6 gpu’s until we replace the bad one by simply assigning it id=6.

Unfortunately, I couldn’t find a way to change the number assigned to gpu’s:

In /proc/driver/nvidia/gpus/, I have
0 1 2 3 4 5 6

which are folders, each holding two files, registry and information:

Model: Tesla M2070
IRQ: 24
Video BIOS: ??.??.??.??.??
Card Type: PCI-E
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:11.00.0

I tried switching contents of folders 1 and 6 as root but it won’t let me write there and it seems sketchy anyway.

If I do nvidia-smi -L, I get

GPU 0: Tesla M2070 (S/N: 0323810023943)
GPU 1: Tesla M2070 (S/N: 0323910067088)
GPU 2: Tesla M2070 (S/N: 0323810024610)
GPU 3: Tesla M2070 (S/N: 0323910066122)
GPU 4: Tesla M2070 (S/N: 0323810024517)
GPU 5: Tesla M2070 (S/N: 0323810023629)
GPU 6: Tesla M2070 (S/N: 0323810011984)

My gut feeling tells me, it MUST be possible to enumerate the gpu’s myself, and I would appreciate it a lot if someone could tell me how…

My next guess is opening the blade and taking out the gpu’s one by one until I have the bad guy… that’s 7 times, worst case… MAN do I NOT wanna do that…

Cheers

Igor

Setting the environment variable CUDA_VISIBLE_DEVICES to 0,2,3,4,5,6 should do what you want.

Thank you very much, that works fine!