I won’t be able to answer all these questions. However:
-
One principal motivator to use shared memory is when there is data reuse. Moving the data to shared memory may save the cost of going to global memory multiple times for the same data.
-
Another principal motivator for shared memory is when inter-thread communication is required. Finding the max of a set of data is a reduction, and some possible realizations of reduction may involve shared memory for “communication”, even if there is no data reuse.
-
Performing reductions on columns of data can be designed in a trivial fashion by having one thread per column. If you have enough columns to saturate the GPU (at first glance, you don’t), then this avoids the need for inter-thread communication.
It’s very likely that the exact nature of the optimal design for this problem will vary slightly by the GPU type you are running on.
A problem like this is a memory bound problem. There is not enough compute density per byte to give the possibility for it to be compute bound. Given that observation, our goal as a CUDA programmer should be, if possible, to load the data from global memory only once.
With 32 columns, we can assign a warp to efficiently load each row of that data. The next question will be how many warps (or warp-loads) do I need to saturate my GPU (which you haven’t indicated.) Let’s pretend we have a v100. That GPU has 80 SMs, each of which can handle 2048 threads or 64 warps. To saturate that GPU, I need (or may need as many as) 80x64 warps, or 5120 warps. If we take your total warp-loads (12,000) and divide by 5120, we have 2-3 loads per warp. Therefore I would have each warp do 2-3 loads, perform its reduction on those 2-3 column elements directly, then perform the remaining reductions as much as possible in shared memory. Therefore instead of dividing your matrix initially into 24 sub matrices, on a v100 I would consider a larger number, like 12000/3 = 4000 sub matrices. Assign a warp to each sub-matrix. Since each threadblock will have up to 32 warps, those 32 warps can then cooperate in shared memory to create a partial reduction result for 32*3 = 96 rows at a time. Therefore you will need around 100-160 threadblocks. These threadblocks will all produce a partial result, which will need to be combined in some fashion in a 2nd reduction phase. Canonical methods for this would be to either write the partial results to global memory and have a second phase reduction, or use atomics at the end of the block reduction to do it in a single kernel launch, or use CG to do it all in a single kernel launch.
The dependency of each column on the previous is certainly going to make this harder, but if possible you want to partially break that dependency, so that much of the work can be done in parallel. For example, suppose you want the result for each column to be the maximum of that column and the previous column. I would consider finding the max of each column first, then finding the necessary cumulative max, so that much of the work can be done in parallel.
A final note: the suggestion to use 4000 sub-matrices might seem like a lot. I’m fairly sure 24 is too small of a subdivision (for a v100). With appropriate code design, you can make this an adjustable parameter, and find out at which point you achieve best performance using a “shmoo” - re-run your code many times with different settings for the number of sub-matrices. For people who are looking for the last ounce of performance, this is a common technique.