Query regarding improving the gld_efficiency metric

Below is the code snippet of a simple maxpool2D kernel. Because of the strided access pattern the gld_efficiency metric is low. Each thread also accesses multiple elements because of the indexing. Is there a better way to write this code to improve the gld_efficiency?

Please post code instead of pictures of code. Precede and follow the code with ``` to achieve code markup.

__global__ void maxpool_2d(int * data ,  int* out , int n , int out_width , int out_height)
{

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x; 

    int value = 0;
    int max_value = 0;
    if((row < out_height) && (col < out_width))
    {
        for (int i = 0; i < filter_size; i++) 
            {
                
                for (int j = 0; j < filter_size; j++) 
                {
                    
                    value = data[(stride * row  + i) * n + (stride * col + j)] ;
                    max_value = (max_value > value) ? max_value : value ;
                }
            }

        out[row * out_width + col] = max_value;
        //printf("row = %d , col = %d , val = %d \n" , row , col, out[row * out_width + col]);
    
    }
 

}







The two approaches that come to mind to improve the gld_efficiency metric would be:

  1. Rather than have one thread handle each out data point, have a warp handle the out data point, and implement a warp-stride loop to cover the patch indicated by the filter dimensions (filter_size, filter_size). If filter_size is large, on the order of 32 or larger, this may be an effective technique both to improve the metric as well as to improve the performance. If filter_size is small, this may possibly improve the metric somewhat, but probably won’t have a significant effect on performance, and may make performance worse.

  2. Reorganize data storage. For example, rearrange the column storage so that adjacent columns are stored at locations that are separated by a stride. This would allow coalesced access.

Neither of the these suggestions are going to be easy or trivial to implement. Also, it’s possible that refactoring efforts in this area may not improve performance. I’m not able to determine what are the limiting factors in your overall code, or if this activity represents an important work area for your code. For example, your code execution could be dominated by the previous weight updates layer, and not by the pooling layer.

1 Like