I would be grateful if you can answer me the following questions:
I have a program that when running with double thread/processes, its GPU-related parts double in run-time (which means that the GPU is the heavy part), however, when running the “nvidia-smi -q” command, The GPU utilization is around 50%.
My guess is that this is due to cudaMemcpy. Is it possible that synchronous cudaMemcpy blocks other GPU kernels in other CPU-threads/programs from running, or do they only collide with other cudaMemcpy commands?
If the answer is Yes, can asynchronous cudaMemcpy commands help solve this conflict, or is my bottleneck actually lies in the amount of data transferred between the host and device?
How can the nvidia-smi command show the cudaMemcpy time? Is my assumption that it does not show it on the GPU utilization correct?
Thanks in advance,