Hello,
I would be grateful if you can answer me the following questions:
-
I have a program that when running with double thread/processes, its GPU-related parts double in run-time (which means that the GPU is the heavy part), however, when running the “nvidia-smi -q” command, The GPU utilization is around 50%.
My guess is that this is due to cudaMemcpy. Is it possible that synchronous cudaMemcpy blocks other GPU kernels in other CPU-threads/programs from running, or do they only collide with other cudaMemcpy commands? -
If the answer is Yes, can asynchronous cudaMemcpy commands help solve this conflict, or is my bottleneck actually lies in the amount of data transferred between the host and device?
-
How can the nvidia-smi command show the cudaMemcpy time? Is my assumption that it does not show it on the GPU utilization correct?
Thanks in advance, Oren