Hi,
I have a multi-threaded python script running on my TX2. The threads perform the following :
thread1:
infinite loop loads images (from file) and dumps them into a list (.append() ), S_thumbnails
computes netvlad for each image. This is done with tensorflow, operation, GPU, takes about 150ms.
pick 2 indices randomly [0, len(S_thumbnails)] and put in a queue, Q’
thread2:
consumes Q’
Also references, S_thumbnails to do image operations using only CPU. S_thumbnails is a global variable
As this runs, I am monitoring the execution time of the tensorflow operation from thread1 and image processing operation from thread2. I observe that GPU slows down when using along with CPU, in this case GPU takes anywhere from 200ms to 500ms (fluctuating execution time). In other words, if I do not do any processing on thread2, gpu operations are faster.
I reckon this is a memory issue with race condition as S_thumbnails is used in both threads. How can I get past this issue? Any suggestions appreciated.
We cannot confirm where the issue comes from since the workflow is complicated.
Instead of race condition, another possibility is from TensorFlow mechanism.
Could you write a pure CUDA code to simulation this use case?
For example,
thread 1:
infinite loop loads images (from file) and dumps them into a list (.append() ), S_thumbnails 2. dummy kernel code for each image.
pick 2 indices randomly [0, len(S_thumbnails)] and put in a queue, Q’
thread 2:
consumes Q’
Also references, S_thumbnails to do image operations using only CPU. S_thumbnails is a global variable
If errors keeps, let’s debug from the pure CUDA sample first.
Actually, I did another set of the trial. Not with CUDA kernels but with the python swig wrapper that my code is dependent. This wrapper computes the Daisy descriptors. https://github.com/mpkuse/daisy_py_wrapper Please checkout the readme of this repo for a brief on how this works.
In this new trial, I replaced the KMeans code with the daisy computation and I notice a slow down of the tensorflow operation. However, with Kmeans (opencv implementation) there is no slow down.
I do a .copy() there because the swig wrapper maps memory to the c++ code. Is this causing the GPU to slow?
The attachment is the plot of running time as iterations progress. It has been made with ROS (robot operating system) and python’s time module. Red curve is the GPU computation.
Guess that there is some synchronize mechanism in the TensorFlow.
For more information, could you profile your use case with nvprof and share the file with us?