When I got started in CUDA, a seasoned veteran told me that a big deal was to decide whether to interpret the
__shared__ memory in chunks of four or eight bytes. The 32 banks and rule about one chunk per bank per clock cycle remains significant, I know, and I seem to recall back in 2018 I could definitely see a difference if the kernels were set to use
So, in my new code base, I have been pretty scrupulous about having functions to set these things for each kernel, especially because I have so much templated and I am making the precision model very flexible (the user will even be able to control what happens, down to the number of fixed precision bits after the decimal). However, I don’t seem to find much (if any) performance difference based on the
cudaSharedMemConfig setting in my tests over the past year, on architectures 7.5 and 8.6 in particular. I could test on 8.0 and 6.1 with cards we have in house, but I have limited bandwidth of my own and was hoping that someone might be able to tell me whether it’s really even something I should bother with. Perhaps its like
__launch_bounds__, something that the compiler knows just fine what to apply and is best left alone?