Efficient use of shared memory

I am new to Cuda and I am wondering what would be the most efficient way of solving my problem. This is not a question about implementation but more about the method.

My problem can be simplified as follows: I have a 2D array stored in consecutive memory, one row after the other. This matrix has a size of 1500x1500. I want to divide this matrix into 15x15 sub-matrices and compute the sum of all elements of a submatrix. The submatrices have the same size. The result of this operation should be a matrix of size 15x15.

In reality my matrix will be 3D and the same operation has to be performed for each 2D slice.

After reading https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ and https://devblogs.nvidia.com/using-shared-memory-cuda-cc/.
Threfore, I known that in a kernel, I should try to use as much as possible aligned and coalesced memory access for each warp of 32 threads and if possible make use of the shared memory. I also understood that in the case of accessing strided data in the global memory, I should consider copying the data into aligned share memory but I don’t know how to apply all those recommendations this to my case.

I suppose that I could have one thread per submatrix that would access all elements of this submatrix and compute the sum, but I do not see how I can combine that with the aligned and coalesced access of global memory within a warp.

What would be the best way to organise my blocks/warps to have an efficient global memory accesses while avoiding the race condition? As each element of the global memory only needs to be read once, is it worth the time to copy each sumbatrix into the shared memory to do my computation?

I am using a TeslaP100 GPU (12GB of RAM, compute capability 6.0).
Thank you!

Use the NPP sum function
https://docs.nvidia.com/cuda/archive/9.1/npp/group__signal__sum.html

Thanks for the hint!
I think the function https://docs.nvidia.com/cuda/npp/group__image__sum.html could do the job as it supports ROI and strided inputs, but I don’t see how I could process all my sums operations in parallel. In total I will have 15x15x600 submatrix for which I need to compute the sum (600 here is the number of 2D slices that I have), so I thought that executing all those sums in parallel would be much more efficient.
This question might be stupid but is it possible to call those NPP functions from inside a kernel?

Calling an NPP function from inside a kernel is not possible.

Unfortunately, all NPP functions use one CUDA stream, so multipe NPP function calls can not be run concurrently on multiple CUDA streams.

Ok, so if I understand well, your suggestion would be to run all the sums sequentially?

That will note utilize your high-end P100 GPU very well … But you could give it a first try to see whether it is fast enough for your purpose.

Regarding NPP library: would be good if in some of the next CUDA Toolkits it gets an additional set of functions which take an additional parameter with the CUDA stream (so that one can run the NPP functions concurrently in different streams) … In the same way as it is done in CUDNN library …

It is not possible to call NPP functions from inside a kernel.

The top 2 optimization priorities for any CUDA programmer are:

  • make efficient use of the memory subsystems
  • launch enough blocks/threads to saturate the GPU (expose enough parallelism)

Shared memory provides value in 2 situations:

  1. When it’s necessary to communicate data from one thread to another (for capability)
  2. When there is data reuse (as an optimizations - to make things run faster)

Therefore, there is no benefit to use shared memory with respect to item 2 (as you stated, you only need to load each item from global memory once). However for item 1 it may be useful – see below.

The problem you are describing is a segmented reduction. Various GPU libraries such as thrust and cub have segmented reduction possibilities (eg: thrust::reduce_by_key) It would only require one such call to produce the entire 15x15x600 output. For simplicity, and perhaps as a “baseline” you may wish to try this first. Since you are asking just for method not implementation, googling “thrust::reduce_by_key” and/or “segmented reduction” should get you going.

You can also write your own kernel. For this problem size, where each reduction (result point) involves summing 100x100 elements, the problem size is large enough to dedicate either a warp or block to that. Your next objective (for speed) is to saturate the GPU. Choose a method that will result in a number of threads equal to or larger than the GPU carrying capacity, which is (for your GPU) 2048 * number of SMs in your P100 (56). So you want a minimum of 562048 threads in your kernel launch, to saturate the P100 = ~100,000 threads, total. 1515*600 is 135,000, so even if we dedicated one thread per output point, we would saturate the GPU.

However one thread per output point will not be optimal for coalesced loads from global memory (our primary concern here. For that we want to dedicate at least a warp to each output point, and we could choose larger entities, such as a threadblock per output point.

With either a warp or threadblock per output point, a very efficient segmented reduction can be written in ordinary CUDA kernel code. I have written several of those on these forums, you could use one as a starting point.

Thanks Robert_Crovella for your very clear explanation.
Would you maybe have a link to such a post you have written in the past?

After reading you, What I have in mind is to use a thread per line of each submatrix. If we take a block per result point it would mean something like 100 thread per block and each of those threads would store the value of the sum over the row in an array stored in the shared memory and then this array would be summed.
The problem I see with this implementation (and I don’t know if it is really a problem) is that the global memory access within a warp (32 lines of the submatrix) won’t be coalesced at all. On the other hand, each thread will read consecutive memory, which looks interesting.

What do you think of this approach?

That is a bad idea. That will lead to inefficient use of global memory read pipe.

Use a warp, or more, to read adjacent values in memory.

https://devtalk.nvidia.com/default/topic/1050216/cuda-programming-and-performance/how-to-add-pointer-array-value/post/5335141/#5335141

https://devtalk.nvidia.com/default/topic/1027177/cuda-programming-and-performance/compute-segmented-sum-using-cuda/post/5224636/#5224636

This I do not understand how I should do, but is might be due to a misunderstanding of memory accesses works.

So let’s consider that I allow one block per output point. If suppose that I won’t use one thread per element of the submatrix, so I have to loop over some matrices elements in my kernel. We have a code that looks like this (not functional but it is to have the idea)

global void myKernel(float* data) 
int i = blockId.x*blockDim.x+threadIdx.x;
float result=0;
int N // Number of element that need to be read by each thread
for int k=0; k<N ; k++) // Loop over one row (consecutive accesses)
{
    result += data[i+k];
}

Are memory access of data efficient in that case (cached from one loop iteration to another?)
If yes, I do not see why my previous suggestion is wrong. If no, I suppose that I should use one thread per columns so that at each iteration, all threads within a warp will read consecutive elements.
The previous code would then look like :

global void myKernel(float* data) 
int i = blockId.x*blockDim.x+threadIdx.x;
float result=0;
int N // Number of element that need to be read by each thread
for int k=0; k<N ; k++) // Loop over one column (strided accesses)
{
result += data[i+k*stride]; // where stride is the number of element on one line of the big (1500x1500) matrix
}

What am I missing?

Yes, they are efficient.

caching is entirely unnecessary and irrelevant for that formulation.

Your first realization is already doing that.

If a given thread were reading consecutive elements within a row, the realization would be:

int idx = threadIdx.x+blockDim.x*blockIdx.x;

for (int i = 0; i < width_of_row; i++)
  result += data[idx*width_of_row + i];

However both of your realizations involve adjacent threads in a warp reading adjacent elements in memory. That is the efficient way.

Thanks for the clarification. Just to be sure that I get it right:

From your previous post, I understand that in the case of a matrix reduction (sum all values of a submatrix, therefore access strided data), it is more efficient to fetch non consecutive data between two iterations of the for loop, and have each thread within a warp accessing consecutive data at each iteration of the loop.
In my case, where the consecutive data is along the rows of the submatrix, it means that each thread will compute the sum over one column.

I am right?

Just another question that is related: Is there an impact on the performance if the data accessed by consecutive iteration of a for loop is consecutive or not?
I thought that it was better to access consecutive data between each iteration of the loop but from the previous post, I realize that I might be wrong…

Thanks a lot anyway for your help Robert!

Yes. The exact realization can vary depending on what you are doing, and your data shapes, but a typical approach is to have each thread sum a column, and at the end, add the column sums together, to create the reduction result element. It might be one thread doing this (producing column sums) or one warp or threadblock doing this, perhaps, for a sum of a matrix or sub-matrix (or even the entire grid).

There probably is, yes, depending on the exact two cases you are comparing. How much performance affect there might be is difficult to say - you could learn by benchmarking. And I’m confident that any performance effects (losses) are less than the loss you will experience if you try fundamentally uncoalesced access. If you have warp-level coalesced access, that is your number 1 goal (one of the 2 top-level optimization priorities, anyway). Make sure you do that first. Then you can look at further optimization.

https://stackoverflow.com/questions/58780710/dont-understand-why-column-addition-faster-than-row-in-cuda/58780836#58780836

I tried to implements my kernel following your suggestions and I came to a rather strange result.

I used one block per submatrix with one warp per block. Each thread of a warp is computing the sum over one column. as agreed before.

Therefore I have the following code :

__global__ void segmented_reduction(fp* data, int* data_size, uint16_t* zones, int* zone_size)
{
	/**
	 * Computes the segmented reduction of the 3D matrix data
	 * Segments are described in the zones matrix where the 4 elements are:
	 * 	start width, end width, start height, end height (in terms of index)
	 *
	 * data_size is the size of the matrix data, first in horizontal, then vertical, then in depth (from consecutive from strided)
	 * blockIdx.x is along the horizontal direction, blockIdx.y vertical, blockIdx.z in depth
	 *
	 * zone_size represent the number of zones in each direction
	 * The zone index is determined by blockIdx.x, blockIdx.y
	 * The slice is determined by blockIdx.z.
	 *
	 * Once the zone ID is determined, the pointer to the top left corner of the zone is created
	 *
	 */

	__shared__ uint16_t zoneLim[4];
	fp temp = 0;
	uint16_t nCols, nRows;

	if (threadIdx.x < 4)
	{
		zoneLim[threadIdx.x] = zones[blockIdx.y*zone_size[0]*4 + blockIdx.x*4 + threadIdx.x];
	}
	nCols = zoneLim[1]-zoneLim[0];
	nRows = zoneLim[3]-zoneLim[2];

	__shared__ fp tempSum[64];

	int sliceDim = data_size[0]*data_size[1];
	fp* zonePtr = data + blockIdx.z * sliceDim + zoneLim[2] * data_size[0] + zoneLim[0]; // Pointer to the top left corner of the zone


    for (int iCol=threadIdx.x; iCol< zoneLim[1]-zoneLim[0]; iCol += 32) // We use 32 threads, but the zone can be larger. Some threads have to compute 2 columns
    {
        for (int iRow = 0; iRow < nRows; iRow++)
        {
            temp += zonePtr[iRow*data_size[0] + iCol];
        }
        tempSum[iCol] = temp;
        temp=0;
    }


	__syncthreads();
	warpReduce(tempSum, threadIdx.x);
}
__device__ void warpReduce(volatile fp* data, int tid)
{
	data[tid] += data[tid+32];
	data[tid] += data[tid+16];
	data[tid] += data[tid+8];
	data[tid] += data[tid+4];
	data[tid] += data[tid+2];
	data[tid] += data[tid+1];
}

This code gives me the correct result in 65ms. However I have an occupancy of 50% only as I reach the maximum number of blocks per SM without using the maximum number of warp per SM.
Threfore I had the idea to split the computation of the sum over one column by dedicating 2 threads per columns, to achieve 64 threads per block and therefore 2048 threads per SM. I did it and I was surprised that my computation time doubled, even though the (theoretical) occupancy went to 100%, without any branch divergence. Here is my code (updated to take into account the 64 thread per warp):

__global__ void segmented_reduction(fp* data, int* data_size, uint16_t* zones, int* zone_size)
{
	__shared__ uint16_t zoneLim[4];
	fp temp = 0;
	uint16_t nCols, nRows;

	if (threadIdx.x < 4)
	{
		zoneLim[threadIdx.x] = zones[blockIdx.y*zone_size[0]*4 + blockIdx.x*4 + threadIdx.x];
	}
	__syncthreads();
	nCols = zoneLim[1]-zoneLim[0];
	nRows = zoneLim[3]-zoneLim[2];

	__shared__ fp tempSum[128]; // 64 is better for reduction as we use 32 threads

	int sliceDim = data_size[0]*data_size[1];
	fp* zonePtr = data + blockIdx.z * sliceDim + zoneLim[2] * data_size[0] + zoneLim[0]; // Pointer to the top left corner of the zone

	if (threadIdx.x < 32)
	{
		for (int iCol=threadIdx.x; iCol< zoneLim[1]-zoneLim[0]; iCol += 32)
		{
			for (int iRow = 0; iRow < (int) nRows/2; iRow++)
			{
				temp += zonePtr[iRow*data_size[0] + iCol];
			}
			tempSum[iCol] = temp;
			temp=0;
		}
	}
	else
	{
		for (int iCol=threadIdx.x-32; iCol< zoneLim[1]-zoneLim[0]; iCol += 32)
		{
			for (int iRow = (int) nRows/2; iRow < nRows; iRow++)
			{
				temp += zonePtr[iRow*data_size[0] + iCol];
			}
			tempSum[iCol+64] = temp;
			temp=0;
		}
	}

	__syncthreads();
	tempSum[threadIdx.x] += tempSum[threadIdx.x + 64];
	__syncthreads();

	if (threadIdx.x <32)
	{
		warpReduce(tempSum, threadIdx.x);
	}
}

I would expect such a difference in computing time if I had branch divergence but it is not the case here as each warp is taking either the if or else statement.

Furthermore, the global load efficiency efficiency that I have is 62%., which I find pretty low. Is that normal and can I do something about it. I have to admit that I did not use aligned memory when assigning my matrices. This is the next step I will try.
What I am missing here (especially for the change in computing time)?

No, don’t do that. That will limit your max occupancy to 50% on most GPUs. Use as many warps as the width of your sub-block, for example it appeared that your sub-block was 100x100, so use 4 warps 4x32 = 128. You can still keep one thread per column that way, and the code should be fairly simple.

if (threadIdx.x < 4)

You don’t ever want to see that on code that is doing bulk loading from global memory. Arrange a group of threads as wide as your sub-block, and let them march down the columns to load the entire sub-block.

The thing is that the size might change to some values around 40-50, but ok, in that case, I will use 2 warps. I was afraid to have too much unused warps.
I still do not understand why I have twice the computing time when I go from one thread per columns to 2 threads per columns, going from 50 to 100% occupancy. My intuition would be that I reduce my computing time as there are less operations per threads.

In that part of the code, I just have to load 4 values from global to share memory, those 4 values are the indexes of each 4 corners of the submatrix (the submatrices do not all have the exact same size). Those are not the actual data values. So I do not really understand how to put your remark in practice.

This is what I had in mind:

$ cat t1602.cu
#include <iostream>

const int sub_block_width  = 100;  // must be 1024 or less
const int sub_block_height = 100;
const int mwidth  = 15*sub_block_width;
const int mheight = 15*sub_block_height;
const int mdepth  = 600;
const int sub_blocks_in_width  = mwidth/sub_block_width;
const int sub_blocks_in_height = mheight/sub_block_height;

template <typename T>
__global__ void k(const T * __restrict__ in, T * __restrict__ out, int h, int w, int d, int sbw, int sbh, int sbiw){

  unsigned block_id = blockIdx.x;
  extern __shared__ T sdata[];
  size_t start_idx = (size_t)block_id * w * sbh;
  for (int i = 0; i < sbiw; i++){
    T myval = 0;
    if (threadIdx.x < sbw){// sum columns
      const T *cols = in + start_idx;
      for (int j = 0; j < sbh; j++){
        myval += cols[threadIdx.x];
        cols += w;}
      }
    sdata[threadIdx.x] = myval; // shared sweep reduction
    for (int k = blockDim.x>>1; k > 0; k>>=1){
      __syncthreads();
      if (threadIdx.x < k) sdata[threadIdx.x] += sdata[threadIdx.x+k];}
    if (!threadIdx.x) out[(block_id*sbiw)+i] = sdata[0]; // write sub-block result
    start_idx += sbw; // do the next horizontal sub-block
    }
}

typedef float mt;

int main(){

  const int block_width = ((sub_block_width + 31)>>5)<<5; // round up to next multiple of 32
  if (block_width > 1024) {std::cout << "sub_block_width of " << sub_block_width << "is greater than 1024, too large for this code design" << std::endl; return 0;}
  const size_t ds = mwidth*mheight*mdepth;
  const size_t rs = sub_blocks_in_width*sub_blocks_in_height*mdepth;
  mt *h_data, *d_data, *h_result, *d_result;
  h_data = new mt[ds];
  h_result = new mt[rs];
  cudaMalloc(&d_data, ds*sizeof(mt));
  cudaMalloc(&d_result, rs*sizeof(mt));
  for (int i = 0; i < ds; i++) h_data[i] = 1;
  cudaMemcpy(d_data, h_data, ds*sizeof(mt), cudaMemcpyHostToDevice);
  k<<<sub_blocks_in_height*mdepth,block_width, block_width*sizeof(mt)>>>(d_data, d_result, mheight, mwidth, mdepth, sub_block_width, sub_block_height, sub_blocks_in_width);
  cudaMemcpy(h_result, d_result, rs*sizeof(mt), cudaMemcpyDeviceToHost);
  for (int i = 0; i < rs; i++)
    if (h_result[i] != sub_block_width*sub_block_height) { std::cout << "mismatch at: " << i << " was: " << h_result[i] << " should be: " << sub_block_width*sub_block_height << std::endl; return 0;}
}
$ nvcc -o t1602 t1602.cu
$ nvprof ./t1602
==25742== NVPROF is profiling process 25742, command: ./t1602
==25742== Profiling application: ./t1602
==25742== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.44%  1.42115s         1  1.42115s  1.42115s  1.42115s  [CUDA memcpy HtoD]
                    0.55%  7.9189ms         1  7.9189ms  7.9189ms  7.9189ms  void k<float>(float const *, float*, int, int, int, int, int, int)
                    0.00%  44.065us         1  44.065us  44.065us  44.065us  [CUDA memcpy DtoH]
      API calls:   79.77%  1.42998s         2  714.99ms  8.2951ms  1.42169s  cudaMemcpy
                   19.60%  351.31ms         2  175.65ms  345.43us  350.96ms  cudaMalloc
                    0.30%  5.4315ms         4  1.3579ms  689.26us  3.3206ms  cuDeviceTotalMem
                    0.28%  5.0121ms       388  12.917us     379ns  531.24us  cuDeviceGetAttribute
                    0.04%  654.48us         4  163.62us  102.83us  336.49us  cuDeviceGetName
                    0.01%  148.10us         1  148.10us  148.10us  148.10us  cudaLaunchKernel
                    0.00%  26.319us         4  6.5790us  3.5510us  11.711us  cuDeviceGetPCIBusId
                    0.00%  7.3040us         8     913ns     415ns  1.5200us  cuDeviceGet
                    0.00%  3.5450us         3  1.1810us     445ns  1.8090us  cuDeviceGetCount
                    0.00%  2.6100us         4     652ns     486ns     874ns  cuDeviceGetUuid
$

Not guaranteed to be defect-free.
nvcc -o test test.cu
It runs in about 8ms in my Tesla V100, CUDA 10.1.243, RHEL 7, using nvprof. (What GPU are you running on?)
I did not grasp till now that the submatrices all have different sizes.

It’s not obvious to me either. I generally prefer not to analyze code simply by staring at it. I like to use tools (compilers, profilers, cuda-memcheck, etc.) Furthermore let’s just posit that its entirely possible you made a mistake in code you haven’t shown. The net of these statements is that performance questions (in my opinion, really I think that should be implicit) should be accompanied by a complete test case. What I have (now) posted by way of example is:

  1. A complete, compilable code.
  2. A description of the system (GPU, CUDA version, OS)
  3. The complete compilation instructions
  4. An indication of how measurements are being made, if its not obvious.

Thanks for your advice. I will work on it and come back later with more detailed code sample, compiling and profiling information.

From now, the major difference I see is that each block performs the sum of a line of horizontal submatrices, one after the other, whereas in my case I had one block per submatrix.

I ran your code on my GPU (Tesla P100 - Cuda 10.1) and it takes 17ms.