Change id assigned to gpu's

ipodladtchikov · July 5, 2011, 11:32pm

Hello,

one of the seven M2070’s in our gpu cluster is acting strange, and I would like to bypass it for the time being by giving it a high device id.

Unfortunately, it’s current device id is 1, so whenever an application uses 2 or more gpu’s, it picks the bad one and crashes. I can identify which one is bad, because I have a test app that can run on a selected device id, and id=1 fails, while all others succeed. (fails on cuCtxSynchronize with driver api and cudaMemcpy with runtime api). I agree it’s strange to be visible, accessible, print the same info as all the others in nvidia-smi -q output, and still crash, but what can I do. The bigger problem at the moment is that this one gpu is blocking 6 gpu’s from being used, just by having a low device id.

I know cudaSetDevice(cpu_thread_id) is not the best way to do things, but it’s deep inside a huge application and I was wondering if I could allow users to use up to 6 gpu’s until we replace the bad one by simply assigning it id=6.

Unfortunately, I couldn’t find a way to change the number assigned to gpu’s:

In /proc/driver/nvidia/gpus/, I have
0 1 2 3 4 5 6

which are folders, each holding two files, registry and information:

Model: Tesla M2070
IRQ: 24
Video BIOS: ??.??.??.??.??
Card Type: PCI-E
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:11.00.0

I tried switching contents of folders 1 and 6 as root but it won’t let me write there and it seems sketchy anyway.

If I do nvidia-smi -L, I get

GPU 0: Tesla M2070 (S/N: 0323810023943)
GPU 1: Tesla M2070 (S/N: 0323910067088)
GPU 2: Tesla M2070 (S/N: 0323810024610)
GPU 3: Tesla M2070 (S/N: 0323910066122)
GPU 4: Tesla M2070 (S/N: 0323810024517)
GPU 5: Tesla M2070 (S/N: 0323810023629)
GPU 6: Tesla M2070 (S/N: 0323810011984)

My gut feeling tells me, it MUST be possible to enumerate the gpu’s myself, and I would appreciate it a lot if someone could tell me how…

My next guess is opening the blade and taking out the gpu’s one by one until I have the bad guy… that’s 7 times, worst case… MAN do I NOT wanna do that…

Cheers

Igor

tmurray · July 6, 2011, 5:18am

Setting the environment variable CUDA_VISIBLE_DEVICES to 0,2,3,4,5,6 should do what you want.

ipodladtchikov · July 6, 2011, 2:42pm

Thank you very much, that works fine!

Topic		Replies	Views
Device Enumeration and cudaSetDevice SDK Examples Failing to Run on Device 0, but run fine on Device CUDA Programming and Performance	5	30676	August 25, 2011
Any setting(enivronment variable etc) to not all GPU's in Multi-GPU hardware? CUDA Programming and Performance	2	589	June 29, 2011
Change Device Order Change default GPU CUDA Programming and Performance	6	8302	October 13, 2011
MultiGPU usage CUDA Setup and Installation	2	915	April 9, 2015
How Can I change device order in multiGPU? CUDA Programming and Performance	3	8017	May 6, 2008
CUDA_VISIBLE_DEVICES being ignored CUDA Setup and Installation	9	20733	March 15, 2016
disable cuda scheduler having multiple CUDA capable devices CUDA Programming and Performance	3	750	February 15, 2011
CUDA capable device ordering CUDA Setup and Installation	2	1080	March 25, 2013
How to change device in NVIDIA CUDA sample programs CUDA Programming and Performance	1	1757	January 10, 2009
How to query device #s of available GPU devices? CUDA Programming and Performance	14	24721	May 5, 2009

Change id assigned to gpu's

Related topics