I have a kernel that basically reads data from global memory into shared memory, do some calculation, then write to output global memory. The memory access is coalesced as it can be seen from the profiler, and there is no bank conflict. The occupancy is 25% as it uses about 2k shared memory for each block and I am using 32 threads for one block. The active threads are 15 but the eligible threads are 1.5. There is some code branch but it is required by the application. The shared mem stats shows that SM to shared bandwidth used is about 210GB/s. IPC issued and executed are very close (1.25), instruction serialization is 9%.
Is there any room to improved the performance based on the above numbers? Is the bottleneck on the shared memory bandwidth? Or maybe the instruction dependency makes the eligible threads low (is there a way to tell)?
Attached the exported profiler results and device summary.