Does Cuda memcpy locks device ?

Hello,

I would be grateful if you can answer me the following questions:

  1. I have a program that when running with double thread/processes, its GPU-related parts double in run-time (which means that the GPU is the heavy part), however, when running the “nvidia-smi -q” command, The GPU utilization is around 50%.
    My guess is that this is due to cudaMemcpy. Is it possible that synchronous cudaMemcpy blocks other GPU kernels in other CPU-threads/programs from running, or do they only collide with other cudaMemcpy commands?

  2. If the answer is Yes, can asynchronous cudaMemcpy commands help solve this conflict, or is my bottleneck actually lies in the amount of data transferred between the host and device?

  3. How can the nvidia-smi command show the cudaMemcpy time? Is my assumption that it does not show it on the GPU utilization correct?

Thanks in advance,
Oren

Hello,

I would be grateful if you can answer me the following questions:

  1. I have a program that when running with double thread/processes, its GPU-related parts double in run-time (which means that the GPU is the heavy part), however, when running the “nvidia-smi -q” command, The GPU utilization is around 50%.
    My guess is that this is due to cudaMemcpy. Is it possible that synchronous cudaMemcpy blocks other GPU kernels in other CPU-threads/programs from running, or do they only collide with other cudaMemcpy commands?

  2. If the answer is Yes, can asynchronous cudaMemcpy commands help solve this conflict, or is my bottleneck actually lies in the amount of data transferred between the host and device?

  3. How can the nvidia-smi command show the cudaMemcpy time? Is my assumption that it does not show it on the GPU utilization correct?

Thanks in advance,
Oren

In general, a single CUDA device can only service one context at a time and the CUDA driver can only switch between contexts between operations. If you have multiple host processes accessing the GPU, then a CUDA device operation in one context will block other processes. Asynchronous memory copies will only overlap with kernel execution if they happen in the same CUDA context (i.e. the same host process and possibly the same host thread).

As for the GPU utilization percentage, I don’t know how that is calculated, so I don’t know if your hypothesis for why you get 50% is correct. Are you sure that cudaMemcpy time is not included? (You could check by writing a program that did nothing but cudaMemcpy() in a loop…)

In general, a single CUDA device can only service one context at a time and the CUDA driver can only switch between contexts between operations. If you have multiple host processes accessing the GPU, then a CUDA device operation in one context will block other processes. Asynchronous memory copies will only overlap with kernel execution if they happen in the same CUDA context (i.e. the same host process and possibly the same host thread).

As for the GPU utilization percentage, I don’t know how that is calculated, so I don’t know if your hypothesis for why you get 50% is correct. Are you sure that cudaMemcpy time is not included? (You could check by writing a program that did nothing but cudaMemcpy() in a loop…)