I am trying to troubleshoot an issue where multiple Python/theano processes running on different GPUs within the same box are slowing each other down.
The code is written such that all data is copied to the GPU once at the beginning of the execution of the program, and then incrementally there are very small chunks of data (8 bytes) being sent back and forth e.g. every 10 seconds. Because of this, I would not expect that code running on one GPU would impact another GPU (is that a correct assumption?)
To troubleshoot the issue, I’ve tried starting multiple processes at the same time under nvprof, and then comparing the per-cuda-call time when I just use one card vs when I use 2, 3, 4. The goal for doing so was to see whether all cuda primitives are slowing down by the same factor, or if there were certain operations that were slowing down more. I was hoping this would give me some direction into why the slowdown may be happening.
Trouble is, if I run 3 nvprofs at the same time, two of the runs succeed, but the third one (consistently) fails mid-execution. No error in the logs - process just exists.
Is that expected? I would not be surprised if nvprof does not like having multiple simultaneous runs. If that is the case, how would you suggest debugging the issue? Any help will be much appreciated.
I run the code as follows:
CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=mode=FAST_RUN,device=gpu1,profile=True,floatX=float32 /usr/local/cuda-7.5/bin/nvprof --cpu-profiling on --print-gpu-summary --log-file nvprof_%p.log python code.py
The cards are Maxwell Titan Xs (in case relevant), with cuda 7.5 and theano 0.8.