Trade offs between loading cost of loading to shared memory and working directly on global memory

I have a tall matrix ( 32 columns by 12000 rows ). I want to find the max element of each columns one by one (I mean I have dependency between columns ).

I have devided original matrix to some sub matrix by rows, e.g. sub0 ->(col0 to col 32, row0 to row 512). So I have 24 sub matrixs.

Now I working on matrix elements in global memory. I have some questions.

  1. Is it better to split the matrix in some sub-mattrix with more than 1 column or work on it column by column (each thread block holds an entire column)?

  2. Can I load each sub-matrixes as a grid in shared memory? I think if the data type is float the shared memory that I need for each block is 4512 byte (to work on a column of sub-matrix) and for a grid (that contain 32 column of sub-matrix ) it should be 324*512 byte (~64KB). So I can store it inside of one SM shared memory (GV100 or GA100). Is this correct deduction?

  3. Is it optimal to load data from global memory to shared memory + finding max + storing back to global memory in comparison to just work with global memory and finding max element of each columns?

  4. What about registers? can I move data from global memory to register faster?

  5. As I am seeing decleration of variable in register is same as global memory, just defining it inside of thread is enough to be inside of register. Is it possible to do some vectorisation for moving data from global memory to register? Like this line of code for vectorized memory access:

reinterpret_cast<int2*>(d_out)[i] = reinterpret_cast<int2*>(d_in)[i];

I won’t be able to answer all these questions. However:

  1. One principal motivator to use shared memory is when there is data reuse. Moving the data to shared memory may save the cost of going to global memory multiple times for the same data.

  2. Another principal motivator for shared memory is when inter-thread communication is required. Finding the max of a set of data is a reduction, and some possible realizations of reduction may involve shared memory for “communication”, even if there is no data reuse.

  3. Performing reductions on columns of data can be designed in a trivial fashion by having one thread per column. If you have enough columns to saturate the GPU (at first glance, you don’t), then this avoids the need for inter-thread communication.

It’s very likely that the exact nature of the optimal design for this problem will vary slightly by the GPU type you are running on.

A problem like this is a memory bound problem. There is not enough compute density per byte to give the possibility for it to be compute bound. Given that observation, our goal as a CUDA programmer should be, if possible, to load the data from global memory only once.

With 32 columns, we can assign a warp to efficiently load each row of that data. The next question will be how many warps (or warp-loads) do I need to saturate my GPU (which you haven’t indicated.) Let’s pretend we have a v100. That GPU has 80 SMs, each of which can handle 2048 threads or 64 warps. To saturate that GPU, I need (or may need as many as) 80x64 warps, or 5120 warps. If we take your total warp-loads (12,000) and divide by 5120, we have 2-3 loads per warp. Therefore I would have each warp do 2-3 loads, perform its reduction on those 2-3 column elements directly, then perform the remaining reductions as much as possible in shared memory. Therefore instead of dividing your matrix initially into 24 sub matrices, on a v100 I would consider a larger number, like 12000/3 = 4000 sub matrices. Assign a warp to each sub-matrix. Since each threadblock will have up to 32 warps, those 32 warps can then cooperate in shared memory to create a partial reduction result for 32*3 = 96 rows at a time. Therefore you will need around 100-160 threadblocks. These threadblocks will all produce a partial result, which will need to be combined in some fashion in a 2nd reduction phase. Canonical methods for this would be to either write the partial results to global memory and have a second phase reduction, or use atomics at the end of the block reduction to do it in a single kernel launch, or use CG to do it all in a single kernel launch.

The dependency of each column on the previous is certainly going to make this harder, but if possible you want to partially break that dependency, so that much of the work can be done in parallel. For example, suppose you want the result for each column to be the maximum of that column and the previous column. I would consider finding the max of each column first, then finding the necessary cumulative max, so that much of the work can be done in parallel.

A final note: the suggestion to use 4000 sub-matrices might seem like a lot. I’m fairly sure 24 is too small of a subdivision (for a v100). With appropriate code design, you can make this an adjustable parameter, and find out at which point you achieve best performance using a “shmoo” - re-run your code many times with different settings for the number of sub-matrices. For people who are looking for the last ounce of performance, this is a common technique.

Robert, thank you for your complete and accurate explanation. I have some doubts.

  1. I did not understand this part about perform its reduction on those 2-3 column elements directly. Why 2-3 columns? What is the meaning of directly?

Therefore I would have each warp do 2-3 loads, perform its reduction on those 2-3 column elements directly, then perform the remaining reductions as much as possible in shared memory.

  1. Let me explain more about column dependency. After reduction of first column (i=1) I have to do rank one update to the rest 31 columns (31=32-i). In that case 2-3 loads per warp won’t it reduce efficiency?

  2. With this small submatrices finding the max element of each column won’t be slow? As I understand I have to find the max between 3 warps (each warp contain 1 element of each column) and after that second reduction betwine 4000 sub matrixes to get the max of max_i.

Not 2-3 columns. 2-3 column elements. I previously stated that each warp was responsible for reading one row (per warp-load). That means each thread is responsible for 1 column. A thread reading 2-3 column elements means that one thread will read column 0, row 0, then read column 0, row 1, and then read column 0 row 2, and perform the necessary reduction op across those 3 column elements.

Thereafter the remaining reduction ops performed in that threadblock that that warp belonged to would be performed using shared memory techniques. Thereafter you would use one of the 3 canonical methods I mentioned, to produce a single reduction result per column.

However I was also saying it was very advisable to see if the reduction work could be done somewhat independently. If the second column depends on the reduction of the first column in a non-simple way, then this may not be possible, effectively serializing the work across columns.

In that case, its quite possible that none of my suggestions will be useful for you. In that case you have a very ugly problem characterized by:

  • not much work to be done in each step (a reduction across 12,000 elements, fairly small work by GPU standards)
  • canonically bad data organization for the work (data stored striding through memory, columnar)

For these reasons I would seek either a data transformation (transpose of the underlying matrix, if possible in a previous step) or careful study of the algorithm to see if the serial nature can be somehow partially broken, which is what I referred to previously.

On that very last point, your description “After reduction of first column (i=1) I have to do rank one update to the rest 31 columns (31=32-i).” doesn’t educate me enough to make any suggestions. According to what I understand about a reduction, it would produce a single number. According to what I know about a rank-1 update it requires two vectors. So I’m not able to connect the dots for you. I have no idea what it means. However I would still seek to:

  1. Discover mathematically if the reduction operation per column could be done independently of the “rank-1 update” somehow. That is, if the rank-1 update could be applied in some fashion after the reduction. Yes, this may not be possible. I don’t really understand the math you are suggesting, anyway.

  2. Seek in an earlier step to see if the data could be deposited in a transposed fashion. This would at least allow the reduction per row to be handled in a coalesced fashion.

In any event, I think I’ve shared with you most of the thoughts I have for how to tackle this problem. I may not be able to respond to further questions here. As an aside, your questions might be improved with a precise description and/or example of the work you are doing. I probably should avoid responding to questions that provide only a general description of the work, so as to not waste your time, and mine.

I would strongly encourage the use of hands-on experiments, in particular also involving use of the CUDA profiler, to explore the design space and develop some experience / intuition.

It is advisable to have a rough plan / design before embarking on coding up an algorithm. But up-front gedankenexperiments are only going to carry programmers some distance to the final goal of a high-performance implementation, therefore there is risk from overly detailed paper designs. IMHO.