According to Nsight Compute the bottleneck is visible in the scheduler statistics:
The scheduler statistics shows, that I should increase the number of eligable warps by reducing the time the active warps are stalled.
Is there a simple solution to reduce the stalled time for active warps?
The warp stastics says:
[Warning] On average each warp of this kernel spends 185.8 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about 75.4% of the total average of 246.3 cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality or by changing the cache configuration, and consider moving frequently used data to shared memory.