Do I need to set cudaSharedMemConfig anymore?

When I got started in CUDA, a seasoned veteran told me that a big deal was to decide whether to interpret the __shared__ memory in chunks of four or eight bytes. The 32 banks and rule about one chunk per bank per clock cycle remains significant, I know, and I seem to recall back in 2018 I could definitely see a difference if the kernels were set to use cudaSharedMemBankSizeEightByte versus cudaSharedMemBankSizeFourByte.

So, in my new code base, I have been pretty scrupulous about having functions to set these things for each kernel, especially because I have so much templated and I am making the precision model very flexible (the user will even be able to control what happens, down to the number of fixed precision bits after the decimal). However, I don’t seem to find much (if any) performance difference based on the cudaSharedMemConfig setting in my tests over the past year, on architectures 7.5 and 8.6 in particular. I could test on 8.0 and 6.1 with cards we have in house, but I have limited bandwidth of my own and was hoping that someone might be able to tell me whether it’s really even something I should bother with. Perhaps its like __launch_bounds__, something that the compiler knows just fine what to apply and is best left alone?

This is irrelevant post-Kepler. The architectures post kepler have fixed size 4-byte banks. This is discoverable with a careful read of the “compute capabilities” section of a programming guide that still had Kepler in view, e.g. 11.x or before.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.