DRAM Excessive Read Sectors

I am working on a kernel that merges a set of images into one base Image, by taking the max from the set of images and putting them into the final image. I have managed to get the runtime down to 8ms which is great however when I profile it Nsight compute seems to suggest that a further 77% improvement is possible coming from dram excessive read sectors.
Here is the kernel code:

__global__ void max_subpixel_coaleseced(unsigned char* baseImage, unsigned char* images, imageInfo* imageInfo) {
    
int width = blockDim.x;
    int height = gridDim.x;
    size_t offset = ((blockIdx.x) * width + threadIdx.x) * 4;
    int numChannels = imageInfo->numChannels;
    int imageSize = imageInfo->imageSize;
    if (offset < (width * height)*4) {
        //Reinterpret pointer to grab 4 channels
        unsigned int pixel = *(reinterpret_cast<unsigned int*>(baseImage + offset));
        size_t upperLimit = static_cast<size_t>(imageInfo->numImages) * (imageSize);
        for (size_t vecOffset = offset; vecOffset < upperLimit;  vecOffset+=imageSize){
            //grab 4 channels to compare to
            unsigned int compareto = *(reinterpret_cast<unsigned int*>(images + vecOffset));
            //isolate channels, take the max and add them up
            pixel = (max((pixel & 0xff000000), (compareto & 0xff000000)) |
                max((pixel & 0x00ff0000), (compareto & 0x00ff0000)) |
                max((pixel & 0x0000ff00), (compareto & 0x0000ff00)) |
                max((pixel & 0x000000ff), (compareto & 0x000000ff))
                );
        }
        //store values back into the base image
        *reinterpret_cast<unsigned int*>(baseImage + offset) = pixel;
    }
}

and here is the memory chart:

Fundamentally my questions is if these suggested performance gains are really possible? Or is it the case that Nsight compute just believes that its a memory bound task and theoretically if memory could keep up with this kernel at 100% compute that it would be 77% faster?

This might be better asked on the nsight compute forum. However DRAM excessive read sectors normally would indicate uncoalesced reads (inefficient use of the available memory bandwidth). Since your code sounds like it is/would be/should be memory bound, this consideration is important. Based on your description of your algorithm (finding the max image, pixel-wise) I don’t see any obvious need for uncoalesced reads, unless your storage format is very strange.

I think the speedup factor is presuming that:

  1. This is the limiter to performance
  2. If the “speed of light” could be achieved, this would still be the limiter to performance. “speed of light” here means the same amount of requested data, loaded in a 100% fully/perfectly coalesced fashion.

With those 2 presumptions, the estimated speed up should be approximately achievable. Stated another way, the performance would be dictated by reading the necessary data in the most efficient fashion.

Side remark:

        pixel = (max((pixel & 0xff000000), (compareto & 0xff000000)) |
                 max((pixel & 0x00ff0000), (compareto & 0x00ff0000)) |
                 max((pixel & 0x0000ff00), (compareto & 0x0000ff00)) |
                 max((pixel & 0x000000ff), (compareto & 0x000000ff))
                );

would seem to be functionally identical to

pixel = __vmaxu4 (pixel, compareto);

The various byte-parallel device function intrinsics in CUDA were specifically added with image processing in mind. Only a few of them are hardware accelerated at this point but they are all quite efficient. If your code is completely limited by memory bandwidth it is unlikely to matter, but I thought I would point these out just in case.