Thank you Greg. Could you expand a little on a few points? I will try to make a minimal reproducer and I know that will help debug a lot but it will take a little bit, and it’d be nice to understand things more deeply.
Why wouldn’t the ideal vs actual calls show up if the conflict counter is firing? You mention not patching syscalls but I’m not sure what that means in this context. If you mean calls within the kernel to e.g. printf or a cub library, I don’t have any of that. The kernel is all my own code within one file, with the only external bits being some header intrinsics like __ldg or __half2float, nothing that touches shared memory.
The other oddity is that the target # of blocks in launch bounds affects things. For a specific # of threads, if I target 1 block I get no conflicts, whereas if I target 2 I get conflicts. I don’t see how that’s possible under the programming model unless the compiler is doing something really weird. The target # of blocks is only used in the launch bounds.
One observation: my code requires that the amount of shared memory is proportional to the number of threads, and I do only see conflicts when the combination of target blocks and # threads requires the 64k shared memory configuration (this is on TU102). But there aren’t many configurations I can make that only need 32k so that might be a red herring.