Hi,
to be clear, this question is not really about running “multi-GPU” programs, but about running separate programs, each on its own GPU – which, however, are all on the same machine.
I’ve been prototyping all my Theano optimizations successfully on single-GPU machines. Now I’m trying to run the same code on a cluster. The cluster consists of several machines, each of which has several Titan Black or Tesla K80 GPUs. Naturally, I would like to run multiple instances of the same optimization (with different parameters) on each machine, each with its own GPU.
However, there seems to be some kind of synchronization issue between the GPUs (even though there is no data transfer between the GPUs, or between host and GPUs, for that matter). As soon as the second job is started, the GPU utilization as reported by nvidia-smi goes down (and further down, if more jobs are running on the same machine). I have verified, the computations are actually running slower.
At the same time, each of the processes show as waiting for I/O in “top”.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 jb4726 20 0 166g 1.4g 115m D 31.8 1.1 98:09.74 python
3457 jb4726 20 0 165g 693m 115m D 31.8 0.5 118:32.62 python
4682 jb4726 20 0 166g 2.0g 115m D 31.8 1.5 118:13.68 python
26253 jb4726 20 0 167g 2.0g 115m D 31.8 1.6 136:47.90 python
4833 jb4726 20 0 165g 693m 115m D 31.5 0.5 118:13.95 python
3294 jb4726 20 0 167g 2.0g 115m R 31.2 1.6 118:33.73 python
2351 jb4726 20 0 165g 693m 115m D 30.2 0.5 117:55.23 python
1864 jb4726 20 0 166g 1.4g 115m D 28.9 1.1 88:42.00 python
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:06:00.0 Off | 0 |
| N/A 44C P0 68W / 149W | 1899MiB / 11519MiB | 11% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:07:00.0 Off | 0 |
| N/A 33C P0 72W / 149W | 1881MiB / 11519MiB | 10% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:0A:00.0 Off | 0 |
| N/A 44C P0 63W / 149W | 1892MiB / 11519MiB | 8% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:0B:00.0 Off | 0 |
| N/A 35C P0 76W / 149W | 1880MiB / 11519MiB | 10% E. Process |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:0E:00.0 Off | 0 |
| N/A 41C P0 69W / 149W | 1827MiB / 11519MiB | 15% E. Process |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:0F:00.0 Off | 0 |
| N/A 31C P0 79W / 149W | 1840MiB / 11519MiB | 15% E. Process |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:12:00.0 Off | 0 |
| N/A 39C P0 65W / 149W | 1877MiB / 11519MiB | 14% E. Process |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:13:00.0 Off | 0 |
| N/A 33C P0 81W / 149W | 1831MiB / 11519MiB | 14% E. Process |
+-------------------------------+----------------------+--------------------–+
I did an strace on one of the jobs to see what I/O it might possibly be waiting for. All I can see are accesses to /dev/nvidiactl:
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89887326}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89904633}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89922082}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89939628}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89957034}) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed1a0) = 0
ioctl(5, 0xc0284658, 0x7fffc51ed1b0) = 0
ioctl(5, 0xc0104629, 0x7fffc51ed240) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed1a0) = 0
ioctl(5, 0xc0284658, 0x7fffc51ed1b0) = 0
ioctl(5, 0xc0104629, 0x7fffc51ed240) = 0
ioctl(5, 0xc0b0464a, 0x7fffc51eced0) = 0
ioctl(5, 0xc0384657, 0x7fffc51ecdf0) = 0
ioctl(5, 0xc020462a, 0x7fffc51ece10) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91231512}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91264735}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91283265}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91300924}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91318361}) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed350) = 0
ioctl(5, 0xc0284658, 0x7fffc51ed360) = 0
ioctl(5, 0xc0104629, 0x7fffc51ed3f0) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91986521}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92005889}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92023152}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92040385}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92057535}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92074420}) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed5f0) = 0
ioctl(5, 0xc0284658, 0x7fffc51ed600) = 0
ioctl(5, 0xc0104629, 0x7fffc51ed690) = 0
ioctl(5, 0xc0b0464a, 0x7fffc51ec790) = 0
ioctl(5, 0xc0384657, 0x7fffc51ec6b0) = 0
ioctl(5, 0xc020462a, 0x7fffc51ec6d0) = 0
ioctl(5, 0xc0b0464a, 0x7fffc51ec710) = 0
I’m using cuda 7.5.18 and cudnn 5.0 and theano 0.82 with python 2.7.
The weird thing is that
- the person who manages the cluster says that there are other jobs (not using Theano) that don’t have this problem.
- on my single-GPU machine with a Tesla K40c, I can run several instances of the same optimization on one GPU without this kind of synchronization issue:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
34104 balle 20 0 90.315g 768316 119328 R 98.7 1.2 1:04.60 python
34120 balle 20 0 90.317g 766068 119324 R 98.7 1.2 1:16.13 python
33876 balle 20 0 90.437g 890236 119400 R 98.3 1.4 4:58.90 python
Thu Aug 18 18:55:22 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 0000:02:00.0 Off | 0 |
| 39% 74C P0 140W / 235W | 2011MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 730 Off | 0000:03:00.0 N/A | N/A |
| 0% 54C P8 N/A / N/A | 5MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+--------------------–+
I reinstalled all of my python/theano installation to check that it is the same code I’m running.
Any ideas what could be causing this?
Thanks,
Johannes