About latency bounded kernel


I wish you have your kindly help to understand meaning of latency bounded.

My current understanding is, for a kernel (with compute and memory instructions), if the data dependency is a lot and no instruction level parallelism, then since each instruction waiting for another instruction to finish, then the latency of all instructions cannot be hidden. Then compute utilization and memory bandwidth utilization are both low. The only solution I know is to increase instruction level parallelism.

Is this correct? and is there any other case that will cause low latency problem?


While instruction-level parallelism is traditionally one of the most important latency-covering mechanism in CPUs, the primary latency-covering approach used by GPUs is thread-level parallelism. GPUs provide what is essentially zero-cost context switching between threads, so as soon as one thread stalls, another can be switched in and utilitize the unused execution resources.

So as long as you have a sufficient number of active threads running per SM, basic latencies in the execution pipelines and memory subsystem should be covered. I have not looked into the minimum number of threads required for this on recent GPUs, but would guess it is in the range of 160-256 active threads per SM. Given the increased amount of resources available in modern GPUs, that should not be too difficult to achieve.

Instruction level parallelism is certainly helpful as a latency-covering tool in recent GPUs, but in my experience it has never been more than a minor factor, compared to thread-level parallelism.