one of the seven M2070’s in our gpu cluster is acting strange, and I would like to bypass it for the time being by giving it a high device id.
Unfortunately, it’s current device id is 1, so whenever an application uses 2 or more gpu’s, it picks the bad one and crashes. I can identify which one is bad, because I have a test app that can run on a selected device id, and id=1 fails, while all others succeed. (fails on cuCtxSynchronize with driver api and cudaMemcpy with runtime api). I agree it’s strange to be visible, accessible, print the same info as all the others in nvidia-smi -q output, and still crash, but what can I do. The bigger problem at the moment is that this one gpu is blocking 6 gpu’s from being used, just by having a low device id.
I know cudaSetDevice(cpu_thread_id) is not the best way to do things, but it’s deep inside a huge application and I was wondering if I could allow users to use up to 6 gpu’s until we replace the bad one by simply assigning it id=6.
Unfortunately, I couldn’t find a way to change the number assigned to gpu’s:
In /proc/driver/nvidia/gpus/, I have
0 1 2 3 4 5 6
which are folders, each holding two files, registry and information:
Model: Tesla M2070
Video BIOS: ??.??.??.??.??
Card Type: PCI-E
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:11.00.0
I tried switching contents of folders 1 and 6 as root but it won’t let me write there and it seems sketchy anyway.
If I do nvidia-smi -L, I get
GPU 0: Tesla M2070 (S/N: 0323810023943)
GPU 1: Tesla M2070 (S/N: 0323910067088)
GPU 2: Tesla M2070 (S/N: 0323810024610)
GPU 3: Tesla M2070 (S/N: 0323910066122)
GPU 4: Tesla M2070 (S/N: 0323810024517)
GPU 5: Tesla M2070 (S/N: 0323810023629)
GPU 6: Tesla M2070 (S/N: 0323810011984)
My gut feeling tells me, it MUST be possible to enumerate the gpu’s myself, and I would appreciate it a lot if someone could tell me how…
My next guess is opening the blade and taking out the gpu’s one by one until I have the bad guy… that’s 7 times, worst case… MAN do I NOT wanna do that…