Weird multi-GPU problem

Hi,

to be clear, this question is not really about running “multi-GPU” programs, but about running separate programs, each on its own GPU – which, however, are all on the same machine.

I’ve been prototyping all my Theano optimizations successfully on single-GPU machines. Now I’m trying to run the same code on a cluster. The cluster consists of several machines, each of which has several Titan Black or Tesla K80 GPUs. Naturally, I would like to run multiple instances of the same optimization (with different parameters) on each machine, each with its own GPU.

However, there seems to be some kind of synchronization issue between the GPUs (even though there is no data transfer between the GPUs, or between host and GPUs, for that matter). As soon as the second job is started, the GPU utilization as reported by nvidia-smi goes down (and further down, if more jobs are running on the same machine). I have verified, the computations are actually running slower.

At the same time, each of the processes show as waiting for I/O in “top”.

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                          
 2508 jb4726    20   0  166g 1.4g 115m D 31.8  1.1  98:09.74 python                                                                                                                                                                                                             
 3457 jb4726    20   0  165g 693m 115m D 31.8  0.5 118:32.62 python                                                                                                                                                                                                             
 4682 jb4726    20   0  166g 2.0g 115m D 31.8  1.5 118:13.68 python                                                                                                                                                                                                             
26253 jb4726    20   0  167g 2.0g 115m D 31.8  1.6 136:47.90 python                                                                                                                                                                                                             
 4833 jb4726    20   0  165g 693m 115m D 31.5  0.5 118:13.95 python                                                                                                                                                                                                             
 3294 jb4726    20   0  167g 2.0g 115m R 31.2  1.6 118:33.73 python                                                                                                                                                                                                             
 2351 jb4726    20   0  165g 693m 115m D 30.2  0.5 117:55.23 python                                                                                                                                                                                                             
 1864 jb4726    20   0  166g 1.4g 115m D 28.9  1.1  88:42.00 python
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:06:00.0     Off |                    0 |
| N/A   44C    P0    68W / 149W |   1899MiB / 11519MiB |     11%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:07:00.0     Off |                    0 |
| N/A   33C    P0    72W / 149W |   1881MiB / 11519MiB |     10%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:0A:00.0     Off |                    0 |
| N/A   44C    P0    63W / 149W |   1892MiB / 11519MiB |      8%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:0B:00.0     Off |                    0 |
| N/A   35C    P0    76W / 149W |   1880MiB / 11519MiB |     10%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:0E:00.0     Off |                    0 |
| N/A   41C    P0    69W / 149W |   1827MiB / 11519MiB |     15%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:0F:00.0     Off |                    0 |
| N/A   31C    P0    79W / 149W |   1840MiB / 11519MiB |     15%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 0000:12:00.0     Off |                    0 |
| N/A   39C    P0    65W / 149W |   1877MiB / 11519MiB |     14%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:13:00.0     Off |                    0 |
| N/A   33C    P0    81W / 149W |   1831MiB / 11519MiB |     14%   E. Process |
+-------------------------------+----------------------+--------------------–+

I did an strace on one of the jobs to see what I/O it might possibly be waiting for. All I can see are accesses to /dev/nvidiactl:

clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89887326}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89904633}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89922082}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89939628}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 89957034}) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed1a0)    = 0
ioctl(5, 0xc0284658, 0x7fffc51ed1b0)    = 0
ioctl(5, 0xc0104629, 0x7fffc51ed240)    = 0
ioctl(5, 0xc020462a, 0x7fffc51ed1a0)    = 0
ioctl(5, 0xc0284658, 0x7fffc51ed1b0)    = 0
ioctl(5, 0xc0104629, 0x7fffc51ed240)    = 0
ioctl(5, 0xc0b0464a, 0x7fffc51eced0)    = 0
ioctl(5, 0xc0384657, 0x7fffc51ecdf0)    = 0
ioctl(5, 0xc020462a, 0x7fffc51ece10)    = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91231512}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91264735}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91283265}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91300924}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91318361}) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed350)    = 0
ioctl(5, 0xc0284658, 0x7fffc51ed360)    = 0
ioctl(5, 0xc0104629, 0x7fffc51ed3f0)    = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 91986521}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92005889}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92023152}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92040385}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92057535}) = 0
clock_gettime(0x4 /* CLOCK_??? */, {8113235, 92074420}) = 0
ioctl(5, 0xc020462a, 0x7fffc51ed5f0)    = 0
ioctl(5, 0xc0284658, 0x7fffc51ed600)    = 0
ioctl(5, 0xc0104629, 0x7fffc51ed690)    = 0
ioctl(5, 0xc0b0464a, 0x7fffc51ec790)    = 0
ioctl(5, 0xc0384657, 0x7fffc51ec6b0)    = 0
ioctl(5, 0xc020462a, 0x7fffc51ec6d0)    = 0
ioctl(5, 0xc0b0464a, 0x7fffc51ec710)    = 0

I’m using cuda 7.5.18 and cudnn 5.0 and theano 0.82 with python 2.7.

The weird thing is that

  1. the person who manages the cluster says that there are other jobs (not using Theano) that don’t have this problem.
  2. on my single-GPU machine with a Tesla K40c, I can run several instances of the same optimization on one GPU without this kind of synchronization issue:
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                   
34104 balle     20   0 90.315g 768316 119328 R  98.7  1.2   1:04.60 python                                                                                                                                                                                                    
34120 balle     20   0 90.317g 766068 119324 R  98.7  1.2   1:16.13 python                                                                                                                                                                                                    
33876 balle     20   0 90.437g 890236 119400 R  98.3  1.4   4:58.90 python                                                                                                                                                                                                    

Thu Aug 18 18:55:22 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:02:00.0     Off |                    0 |
| 39%   74C    P0   140W / 235W |   2011MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 730      Off  | 0000:03:00.0     N/A |                  N/A |
|  0%   54C    P8    N/A /  N/A |      5MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+--------------------–+

I reinstalled all of my python/theano installation to check that it is the same code I’m running.

Any ideas what could be causing this?

Thanks,
Johannes

When you start a particular job and you want to run it on a particular GPU, preface your command line with:

CUDA_VISIBLE_DEVICES="X" ./myjob ...

where X is the GPU number on the cluster node you are using eg. from 0 to 7 for the first nvidia-smi output you have shown.

nvidia-smi reports the process-to-GPU assignment. Why would you cut that off when posting your nvidia-smi output for this question? That could provide useful information for understanding what is happening.

I believe that is exactly what the job scheduler on the cluster does to make sure jobs get assigned to the right GPUs.

I didn’t post it because I copy and pasted the info from an email exchange I had earlier, where it happened to be missing. Also: yes, the jobs are assigned to the right GPUs – I would be happy if it was that simple. Here’s another example from a different machine, it basically looks like this whatever machine on the cluster I try:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  On   | 0000:02:00.0     Off |                  N/A |
| 42%   67C    P2    97W / 250W |   1852MiB /  6143MiB |     19%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  On   | 0000:03:00.0     Off |                  N/A |
| 43%   69C    P2    92W / 250W |   1855MiB /  6143MiB |     19%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  On   | 0000:83:00.0     Off |                  N/A |
| 41%   66C    P2    95W / 250W |   1823MiB /  6143MiB |     19%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:84:00.0     Off |                  N/A |
| 43%   69C    P2   100W / 250W |   1872MiB /  6143MiB |     20%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     30345    C   python                                        1856MiB |
|    1     21811    C   python                                        1855MiB |
|    2     24004    C   python                                        1833MiB |
|    3     20842    C   python                                        1840MiB |
+-----------------------------------------------------------------------------+

I don’t want to interfere with txbob’s analysis (he has the necessary domain knowledge when it comes to multi-GPU configurations, I don’t), but just as a quick check: Do the machines in question have adequate power supplies to drive all these GPUs in the box?

That question I cannot answer, since I’m not responsible for administrating the cluster, but let’s say that I trust them to have this figured out correctly. After all, it’s a large scale installation with over 20 GPU machines, and they haven’t started doing this yesterday …

The description of every engineering problem I ever heard about started with “I/we assumed that …”. So my attitude as an engineer has always been “Never assume anything”. Or, as a more famous person stated it: “Trust, but verify”. I am not saying a power issue is likely here, but it is one thing one can cross off the list. That’s why first-line customer service people usually ask: “Is the device plugged in and turned on?”

Sure. I’m not ruling out that there is some kind of a problem in the installation (like a driver issue or a bug, for instance).

But with regards to the power supply specifically:

  • I simply can’t verify this in person. There’s nothing I can do about it.
  • I would think that if this installation has been used and saturated with jobs from several groups of users simultaneously for months/years, this problem should have surfaced earlier.

Alright, I see that you edited your original answer, my philosophical comment isn’t needed any more :)
I don’t think we can cross the power supply off the list, but it should be unlikely.

Ok, I think I might have just found a workaround, or even something related to the root of the problem. I’ll go ahead and describe it for the benefit of other people who might have the same issue. It’s funny that I find this now, only a few hours after I finally decided to seek help because I haven’t been able to figure it out for days.

In Theano 0.82, there are two config settings named gpuarray.sync and gpuelemwise.sync, the first of which is False by default, and the second of which, for some reason, is True by default. Neither of them is mentioned in the docs. I just found them looking for something completely different. I had always followed the instructions given in the Theano documentation for optimizing your code. Apparently, that info is outdated …

It turns out that setting the second one to False not only improves the GPU utilization of a standalone process by quite a bit, it miraculously also seems to solve the cross-GPU synchronization issue. Why this flag should ever cause synchronization between multiple GPUs when the code is only talking to one GPU at a time, is beyond me. That must be either some side-effect that’s desirable in another situation (computations using multiple GPUs?) or a bug in Theano. Maybe someone with a better understanding of how the driver talks to cuda code can explain this.

It also doesn’t really explain why the slowdown doesn’t happen when I run multiple process instances on a single GPU. If I find any other useful info, I’ll post it here. Maybe it is only triggered when CUDA_VISIBLE_DEVICES is set. For now, it looks like this is a viable workaround, although I’m not quite sure why the flag is set to True by default. Maybe it’s experimental code?