Can someone help me understand the profiling metric "Compute Utilization"?

So like, what is the stat “Compute Utilization” mean? What’s considered a “good” compute utilization? Or rather, what’s a value that is considered performant?

I tried looking this up on my own and the gist of what I got is that it’s basically just the ratio of the total number of instructions of the process vs the total number of cycles the process took. Is this accurate?

Is your post mistakenly empty?

I wrote something, and decided it didn’t make sense, so I deleted it. I don’t know how to delete a posting entirely.

If you mean the overall cycles during which actual work was done by the process, vs the maximum amount of cycles during the execution time, I agree.
Instructions might be a bit misleading, since this would mean you are not be able to achieve a 100% compute utilization with multi-cycle instructions.

I think, as so often, the answer to your question is: It depends.

If you have no divergent branches the compute utilization should converge to 100%.
In a kernel with 1 divergent branch the theoretical compute utilization will be 50% if I remember that correctly. The profiler assumes they are taken 50% each. In reality you will sometimes see a higher compute utilization there. This is due to the fact that real data often produces similar data in consecutive memory regions, resulting in full warps taking the same branch.

The major problem with telling what a good compute capability is that it most likely depends on the type of kernel you are building, and the amount of effort your are putting in it. Some applications might not allow creating a kernel which allows efficient processing on a GPU. Other applications can become faster if you optimize the processing flow of your data.

Additionally if you have a task which is very load/story intensive, while you will not do alot arithmetic with the read data, your compute utilization wont look very good either. If you do not have to do anything else with the data, then it is okay, it’s just a limit you will need to accept.

I would suggest to critically review the percent of compute utilization the profiler is showing you, and always comparing it to what you would expect from your specific application.

That’s interesting.

I only ask because it’s roughly ~90% for an application I’ve been working on.

To me, that’s amazing! I literally went from like 10% utilization to something that far away from 100%.

I will always strive towards 100% but I think 90% is good enough (for now). I hope this means the code scales to other systems as well… It’d be neat to see the times on a better GPU.

90% sounds pretty well to me. Just bear in mind that this doesn’t automatically mean that you wrote efficient code.

By the way, Nsight gave me the answer to your question today. Occupancy/Utilization is defined as follows: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessors.