I tried to implements my kernel following your suggestions and I came to a rather strange result.

I used one block per submatrix with one warp per block. Each thread of a warp is computing the sum over one column. as agreed before.

Therefore I have the following code :

```
__global__ void segmented_reduction(fp* data, int* data_size, uint16_t* zones, int* zone_size)
{
/**
* Computes the segmented reduction of the 3D matrix data
* Segments are described in the zones matrix where the 4 elements are:
* start width, end width, start height, end height (in terms of index)
*
* data_size is the size of the matrix data, first in horizontal, then vertical, then in depth (from consecutive from strided)
* blockIdx.x is along the horizontal direction, blockIdx.y vertical, blockIdx.z in depth
*
* zone_size represent the number of zones in each direction
* The zone index is determined by blockIdx.x, blockIdx.y
* The slice is determined by blockIdx.z.
*
* Once the zone ID is determined, the pointer to the top left corner of the zone is created
*
*/
__shared__ uint16_t zoneLim[4];
fp temp = 0;
uint16_t nCols, nRows;
if (threadIdx.x < 4)
{
zoneLim[threadIdx.x] = zones[blockIdx.y*zone_size[0]*4 + blockIdx.x*4 + threadIdx.x];
}
nCols = zoneLim[1]-zoneLim[0];
nRows = zoneLim[3]-zoneLim[2];
__shared__ fp tempSum[64];
int sliceDim = data_size[0]*data_size[1];
fp* zonePtr = data + blockIdx.z * sliceDim + zoneLim[2] * data_size[0] + zoneLim[0]; // Pointer to the top left corner of the zone
for (int iCol=threadIdx.x; iCol< zoneLim[1]-zoneLim[0]; iCol += 32) // We use 32 threads, but the zone can be larger. Some threads have to compute 2 columns
{
for (int iRow = 0; iRow < nRows; iRow++)
{
temp += zonePtr[iRow*data_size[0] + iCol];
}
tempSum[iCol] = temp;
temp=0;
}
__syncthreads();
warpReduce(tempSum, threadIdx.x);
}
```

```
__device__ void warpReduce(volatile fp* data, int tid)
{
data[tid] += data[tid+32];
data[tid] += data[tid+16];
data[tid] += data[tid+8];
data[tid] += data[tid+4];
data[tid] += data[tid+2];
data[tid] += data[tid+1];
}
```

This code gives me the correct result in 65ms. However I have an occupancy of 50% only as I reach the maximum number of blocks per SM without using the maximum number of warp per SM.

Threfore I had the idea to split the computation of the sum over one column by dedicating 2 threads per columns, to achieve 64 threads per block and therefore 2048 threads per SM. I did it and I was surprised that my computation time doubled, even though the (theoretical) occupancy went to 100%, without any branch divergence. Here is my code (updated to take into account the 64 thread per warp):

```
__global__ void segmented_reduction(fp* data, int* data_size, uint16_t* zones, int* zone_size)
{
__shared__ uint16_t zoneLim[4];
fp temp = 0;
uint16_t nCols, nRows;
if (threadIdx.x < 4)
{
zoneLim[threadIdx.x] = zones[blockIdx.y*zone_size[0]*4 + blockIdx.x*4 + threadIdx.x];
}
__syncthreads();
nCols = zoneLim[1]-zoneLim[0];
nRows = zoneLim[3]-zoneLim[2];
__shared__ fp tempSum[128]; // 64 is better for reduction as we use 32 threads
int sliceDim = data_size[0]*data_size[1];
fp* zonePtr = data + blockIdx.z * sliceDim + zoneLim[2] * data_size[0] + zoneLim[0]; // Pointer to the top left corner of the zone
if (threadIdx.x < 32)
{
for (int iCol=threadIdx.x; iCol< zoneLim[1]-zoneLim[0]; iCol += 32)
{
for (int iRow = 0; iRow < (int) nRows/2; iRow++)
{
temp += zonePtr[iRow*data_size[0] + iCol];
}
tempSum[iCol] = temp;
temp=0;
}
}
else
{
for (int iCol=threadIdx.x-32; iCol< zoneLim[1]-zoneLim[0]; iCol += 32)
{
for (int iRow = (int) nRows/2; iRow < nRows; iRow++)
{
temp += zonePtr[iRow*data_size[0] + iCol];
}
tempSum[iCol+64] = temp;
temp=0;
}
}
__syncthreads();
tempSum[threadIdx.x] += tempSum[threadIdx.x + 64];
__syncthreads();
if (threadIdx.x <32)
{
warpReduce(tempSum, threadIdx.x);
}
}
```

I would expect such a difference in computing time if I had branch divergence but it is not the case here as each warp is taking either the *if *or *else *statement.

Furthermore, the global load efficiency efficiency that I have is 62%., which I find pretty low. Is that normal and can I do something about it. I have to admit that I did not use aligned memory when assigning my matrices. This is the next step I will try.

What I am missing here (especially for the change in computing time)?