I have a kernel which runs not as fast as expected. Nsight compute gave these profiling info. Could anybody help with improving it. In particular, how to address the Low Utilization issue in compute workload analysis? Thanks
It seems you forgot to post the kernel code and its launch configuration as well as details about the GPU you are running it on. What have you tried to improve kernel performance?
Just generally, the LSU seems to be the bottleneck, you can improve shared memory accesses and loads and stores, not only for memory, but also in regards to number of instructions.