my application works as follows :
-> initialize data on CPU
-> create 4 threads (one for each GPU)
Each thread then does the following :
-> loop over groups of data
-> launch GPU computations over current group of data
-> create a new thread to finish computations over current group of data on an idle CPU core
The whole thing takes about 7 seconds to compute when I time it via a bash script.
I’m trying to optimize my application so I ran it under the cuda profiler and noticed that it only took about 4 seconds to compute. What’s even more surprising is that when I keep the cuda profiler window open and relaunch my bash script I get a similar timing of about 4 seconds with good results.
I would very much appreciate it if someone could explain to me why there is such a discrepancy. My (humble) guess is that opening the cuda profiler already sets up some sort of connection with all the GPUs, but that does not really explain a 3 second discrepancy.