GPU slows down when CPU processes

Hi,
I have a multi-threaded python script running on my TX2. The threads perform the following :

thread1:

  1. infinite loop loads images (from file) and dumps them into a list (.append() ), S_thumbnails
  2. computes netvlad for each image. This is done with tensorflow, operation, GPU, takes about 150ms.
  3. pick 2 indices randomly [0, len(S_thumbnails)] and put in a queue, Q’

thread2:

  1. consumes Q’
  2. Also references, S_thumbnails to do image operations using only CPU. S_thumbnails is a global variable

As this runs, I am monitoring the execution time of the tensorflow operation from thread1 and image processing operation from thread2. I observe that GPU slows down when using along with CPU, in this case GPU takes anywhere from 200ms to 500ms (fluctuating execution time). In other words, if I do not do any processing on thread2, gpu operations are faster.

I reckon this is a memory issue with race condition as S_thumbnails is used in both threads. How can I get past this issue? Any suggestions appreciated.

PS. Before i start my program I do as root:

nvpmodel -m 0

/home/nvidia/jetson_clocks.sh

Hi,

We cannot confirm where the issue comes from since the workflow is complicated.
Instead of race condition, another possibility is from TensorFlow mechanism.

Could you write a pure CUDA code to simulation this use case?

For example,
thread 1:

  1. infinite loop loads images (from file) and dumps them into a list (.append() ), S_thumbnails
    2. dummy kernel code for each image.
  2. pick 2 indices randomly [0, len(S_thumbnails)] and put in a queue, Q’

thread 2:

  1. consumes Q’
  2. Also references, S_thumbnails to do image operations using only CPU. S_thumbnails is a global variable

If errors keeps, let’s debug from the pure CUDA sample first.

Thanks.

Actually, I did another set of the trial. Not with CUDA kernels but with the python swig wrapper that my code is dependent. This wrapper computes the Daisy descriptors.
https://github.com/mpkuse/daisy_py_wrapper Please checkout the readme of this repo for a brief on how this works.

In this new trial, I replaced the KMeans code with the daisy computation and I notice a slow down of the tensorflow operation. However, with Kmeans (opencv implementation) there is no slow down.

Code for thread2 is as follows:

def consume_queue():
    global S_thumbnails
    global task_queue
    global pub_qsize, pub_time_kmeans
    global XFT
    global dai1, dai2

    while XFT:
        qsize = task_queue.qsize()
        publish_time( pub_qsize, qsize )
        try:
            g = task_queue.get(timeout=1)
            xprint( 'qsize: %d, i_curr: %d, i_prev: %d' %( qsize, g[0], g[1] ), 1 )

            startKMeans = time.time()
            im_curr = S_thumbnails[ g[0] ]
            im_prev = S_thumbnails[ g[1] ]

            # Daisy starts
            im_curr32 = im_curr[:,:,0].copy().astype( 'float32' )
            dai1.do_daisy_computation( im_curr32 )
            vi1 = dai1.get_daisy_view()

            im_prev32 = im_prev[:,:,0].copy().astype( 'float32' )
            dai2.do_daisy_computation( im_prev32 )
            vi2 = dai2.get_daisy_view()
            # Daisy ends
        except:
            print 'thread:', 'empty'

I do a .copy() there because the swig wrapper maps memory to the c++ code. Is this causing the GPU to slow?

The attachment is the plot of running time as iterations progress. It has been made with ROS (robot operating system) and python’s time module. Red curve is the GPU computation.

Hi,

Guess that there is some synchronize mechanism in the TensorFlow.
For more information, could you profile your use case with nvprof and share the file with us?

Thanks.