The GPU concurrentcy and how to monitor GPU utilization. The nvidia-smi tool always show two utilization 0 or 100%.

Hi guys:

I ask for your help. My Cuda application executes four thread, each has its own cuda stream. The kernel execute duration is the same in four thread. When I reduce to one thread in my application, The kernel execute duration is the same as running four thread situation. That's the point I confused. Because when my cuda application execute in one thread, nvidia-smi show my GPU utilization is 100%. According to this situation, when my cuda application execute four thread, I assume the kernel execute duration should be increased, but the result is different I thought. what's the reason. By the way, the nvidia-smi tool always show two utilization 0 or 100%, except these two numbers, there is no more number like 10% or 20% and so on. I am confused about this?

please no hesitate to correct me if i make some mistakes.

Read this:

https://stackoverflow.com/questions/16617796/gpu-utilization/16618010#16618010

and this:

https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#40938696

Your observations are entirely plausible. For example, in your 4 thread scenario, if the kernel execution timeline looked like this:

|k1|k2|k3|k4|

and in the one thread scenario it looked like this:

|k1|k1|k1|k1|

or

|   k1      |

Any of those scenarios would show 100% utilization. You can probably get better understanding by comparing the output of visual profiler timeline in each case.

That’s the point I confused.

  1. each kernel is a mass-parallel job. while this job is executed, it usually utilizes gpu by 100%. between jobs, utilization is 0%

  2. GPU dispatcher usually runs each job to completion before starting a next job, so time of job execution doesn’t depend on the number of streams (cpu threads) simulataneously enqueuing new jobs

  3. but of course, overal performance (throughput) of GPU is divided between those 4 streams, so each stream will complete 4x less jobs in a (big enough) fixed amount of time

overall, streams are useful for ensuring 100% gpu load. with a single stream, gpu should wait until full completion of the job before starting new one, since there may be data dependencies. it’s a so-called tail-effect, and it means less than 100% gpu utilization on the job tail (last grid blocks)

with multiple streams, jobs in other streams are independent on the current job, so next job can be started while the current job continues to execute its tail, and GPU may be always utilized by 100%. this is independent on using single or multiple CPU threads to handling streams, since CPU only enqueues jobs to GPU