Selecting the 8 bytes banks of shared memory

Reading the Kepler tuning guide, section 1.4.3.1 says the following:

This bandwidth increase is exposed to the application through a configurable new 8-byte shared memory bank mode. When this mode is enabled, 64-bit (8-byte) shared memory accesses (such as loading a double-precision floating point number from shared memory)…

And then the Pascal tuning guide has, in section 1.4.5.1:

Applications no longer need to select a preference of the L1/shared split for optimal performance. For purposes of backward compatibility with Fermi and Kepler, applications may optionally continue to specify such a preference, but the preference will be ignored on Maxwell and Pascal.

Then in 1.4.5.2:

"[i]Kepler provided an optional 8-byte shared memory banking mode, which had the potential to increase shared memory bandwidth per SM for shared memory accesses of 8 or 16 bytes. However, applications could only benefit from this when storing these larger elements in shared memory (i.e., integers and fp32 values saw no benefit), and only when the developer explicitly opted in to the 8-byte bank mode via the API.

To simplify this, Pascal follows Maxwell in returning to fixed four-byte banks. This allows, all applications using shared memory to benefit from the higher bandwidth, without specifying any particular preference via the API.[/i]"

I will need to declare the shared memory space as double to avoid some possible overflow during the computation, and if I understand the 1.4.5.2 section of Pascal tuning, I don’t need to specify anything in the API to use the 8 bytes shared memory, which I conclude as just needing to declare the shared object as double instead of float.

Finally, let me ask you the following:

- Is this understanding correct?
- If so, this applies to Maxwell and Pascal, but not Kepler, which means I would still need to select the cache type in the program. The function cudaFuncSetCacheConfig is mentioned here: https://devblogs.nvidia.com/using-shared-memory-cuda-cc/ . But I don’t find it in the API doc: https://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__EXECUTION_1g4f35d04be20a41c5df96613a748eecc1 . Any idea on how I could get the 8 bytes shared memory working on a program from Kepler and above?

Yes. You don’t need to modify the shared mem config anywhere, for correctness. And as already pointed out, on architectures other than Kepler it is a no-op anyway. The purpose of setting it on Kepler is not from a correctness standpoint but from a performance standpoint. All other architectures have a maximum shared memory throughput of 32 bytes per transaction per bank. This means that if you request a double quantity warp-wide (i.e. typical pattern) then the request will be broken into two transactions. On Kepler, if you set the proper mode, such a request can be serviced in a single transaction. This “bandwidth increase” is referred to already at the beginning of your posting.

It is here:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g6699ca1943ac2655effa0d571b2f4f15

However that is not where you set 8 byte mode, and I don’t think it has a direct bearing on this topic.

The function to set the 8-byte/4-byte mode is cudaFuncSetSharedMemConfig documented here:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g3ef735b45b7549e936a60cb084740754

Thanks for the explanation and for linking these docs, txbob.

To put it better into context, I intend to replace the shared memory array of the reduction you previously assisted me with, from a 4 byte type to an 8 byte and avoid/delay overflow.
To prevent bank conflict due to change of data size, do you believe this is a better approach than resorting to padding?

Yes, if you are using only doubles in shared memory, I think it makes sense to put things in 8-byte bank mode on kepler.

Thanks!
Time to get the hands dirty a bit…

I managed to get the thing to work after the changes, the final reduction matches that of the CPU, so I didn’t break anything.
But I came across some strange behaviors, at least to my eyes.

1 - After adjusting everything to work with double while also using a float array as first input (I split into 2 kernels with different parameters instead of overloading it), the process was taking exactly 2x as long to finish. Then I replaced this line:

array_in[i] >= 0 ? sdata[tid] += array_in[i] : sdata[tid] += array_in[i] * -1;

with this:

sdata[tid] += array_in[i] >= 0 ? array_in[i] : -array_in[i]);

And had the performance back. I have no idea how these translate to ASM.

2 - By using double instead of float in shared memory, for some reason the program became much, much more sensitive to kernel launch parameters. With float, kernel launches such as <<<128, 128>>>, <<<200, 64>>> and <<<100, 256>>> used to perform the same (time taken to complete). With double, these mostly perform 2x slower, and the best performance was achieved with <<<200, 256>>>. Reminds me of Vasily Volkov’s article on low(er) occupancy, where he launches kernels with smaller numbers and gets better performance. So it is definitely case-by-case.

3 - Using cudaFuncSetSharedMemConfig(kernel_func, cudaSharedMemBankSizeEightByte) didn’t have any impact in the program, positive or negative, so I will leave it there in case it is run on a Kepler. HOWEVER, in NVidia Visual Profiler, Analysis/Properties of the reduction function, the Shared Memory Bank Size line shows as 4 bytes if I don’t call cudaFuncSetSharedMemConfig even with the shared memory declared as double. If I call it, then it shows as 8 bytes, but as I said, it didn’t impact the program running on a Pascal.

I know this isn’t really what you’re asking for help with, and without seeing the rest of the code, I was wondering whether you had tried fabsf for the above calc to help the compiler out :)

device ​ float fabsf ( float x )
(Calculate the absolute value of its argument.)

sdata[tid] += array_in[i] >= 0 ? array_in[i] : -array_in[i]);

->

sdata[tid] += fabsf(array_in[i]);

The only reason I mention it is that you said that line had material impact to runtime - can’t hurt to try something else.

I had, and using the built-in abs() or fabs() caused quite a large impact on the speed. Since I didn’t generate the ASM code to see the difference, I don’t have any idea on why it was so much slower.

Then, surprisingly I experienced another drop just by switching the standard 4 bytes shared memory bank to 8 bytes so I could work with double precision to delay overflow of the accumulator. I replaced the assignment with ternary operator to the form you copied (which looks cleaner anyway) and the 8 bytes absolute reduction performed as fast as the conventional 4 bytes reduction.

Looking at NVVP, the reduction is more memory-bound than computation, so I thought using the abs/fabs() functions wouldn’t impact, but that was not the case. Don’t really know why, but hey, if it works, it works.

Hi!
Sorry for a reply to this topic over 2 years later. Regarding the shared memory bank conflict, I encounter a problem and really need your guy’s help.

I understand the content above that setting bank bandwidth to 64 bit is only useful for double data, what if I make a mapping that 8 bytes uint8_t data integrate into a double data, and put it in the shared memory. When I accessing it, the 64-bits data could be accessed completely, I could disintegrate the double to 8 uint8_t and utilize these 8 uint8_t to accelerate the computing? Can this scheme avoid bank conflict and make most of the shared memory?

One of the takeaways from above is that the 8-byte bank mode is only present on Kepler devices. Those are the oldest currently supported GPUs (and support for cc3.0 kepler devices has already been dropped from CUDA 11). Unless you have a Kepler device, you should disregard this discussion. Since no GPUs beyond Kepler have incorporated this, it may not be worth your time to build production code based on this feature.

Otherwise, yes, for a Kepler device, you could set 8-byte bank mode and cast any 8-byte type (such as a uint8_t vector type) to any other 8-byte type (such as unsigned long long).

Also, this isn’t primarily about eliminating bank conflicts, but instead is about maximizing shared memory bandwidth for Kepler devices.