DRAM Excessive Read Sectors

theonereed2 · February 7, 2024, 11:32pm

I am working on a kernel that merges a set of images into one base Image, by taking the max from the set of images and putting them into the final image. I have managed to get the runtime down to 8ms which is great however when I profile it Nsight compute seems to suggest that a further 77% improvement is possible coming from dram excessive read sectors.
Here is the kernel code:

__global__ void max_subpixel_coaleseced(unsigned char* baseImage, unsigned char* images, imageInfo* imageInfo) {
    
int width = blockDim.x;
    int height = gridDim.x;
    size_t offset = ((blockIdx.x) * width + threadIdx.x) * 4;
    int numChannels = imageInfo->numChannels;
    int imageSize = imageInfo->imageSize;
    if (offset < (width * height)*4) {
        //Reinterpret pointer to grab 4 channels
        unsigned int pixel = *(reinterpret_cast<unsigned int*>(baseImage + offset));
        size_t upperLimit = static_cast<size_t>(imageInfo->numImages) * (imageSize);
        for (size_t vecOffset = offset; vecOffset < upperLimit;  vecOffset+=imageSize){
            //grab 4 channels to compare to
            unsigned int compareto = *(reinterpret_cast<unsigned int*>(images + vecOffset));
            //isolate channels, take the max and add them up
            pixel = (max((pixel & 0xff000000), (compareto & 0xff000000)) |
                max((pixel & 0x00ff0000), (compareto & 0x00ff0000)) |
                max((pixel & 0x0000ff00), (compareto & 0x0000ff00)) |
                max((pixel & 0x000000ff), (compareto & 0x000000ff))
                );
        }
        //store values back into the base image
        *reinterpret_cast<unsigned int*>(baseImage + offset) = pixel;
    }
}

and here is the memory chart:

Fundamentally my questions is if these suggested performance gains are really possible? Or is it the case that Nsight compute just believes that its a memory bound task and theoretically if memory could keep up with this kernel at 100% compute that it would be 77% faster?

Robert_Crovella · February 7, 2024, 11:37pm

This might be better asked on the nsight compute forum. However DRAM excessive read sectors normally would indicate uncoalesced reads (inefficient use of the available memory bandwidth). Since your code sounds like it is/would be/should be memory bound, this consideration is important. Based on your description of your algorithm (finding the max image, pixel-wise) I don’t see any obvious need for uncoalesced reads, unless your storage format is very strange.

I think the speedup factor is presuming that:

This is the limiter to performance
If the “speed of light” could be achieved, this would still be the limiter to performance. “speed of light” here means the same amount of requested data, loaded in a 100% fully/perfectly coalesced fashion.

With those 2 presumptions, the estimated speed up should be approximately achievable. Stated another way, the performance would be dictated by reading the necessary data in the most efficient fashion.

njuffa · February 8, 2024, 2:15am

Side remark:

        pixel = (max((pixel & 0xff000000), (compareto & 0xff000000)) |
                 max((pixel & 0x00ff0000), (compareto & 0x00ff0000)) |
                 max((pixel & 0x0000ff00), (compareto & 0x0000ff00)) |
                 max((pixel & 0x000000ff), (compareto & 0x000000ff))
                );

would seem to be functionally identical to

pixel = __vmaxu4 (pixel, compareto);

The various byte-parallel device function intrinsics in CUDA were specifically added with image processing in mind. Only a few of them are hardware accelerated at this point but they are all quite efficient. If your code is completely limited by memory bandwidth it is unlikely to matter, but I thought I would point these out just in case.

Topic		Replies	Views
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4211	July 30, 2007
Image processing with CUDA: design question. CUDA Programming and Performance	5	1018	January 26, 2018
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1017	November 1, 2011
Memory workload analysis CUDA Programming and Performance	1	982	January 7, 2022
What limits the performance of non-coalesced memory reads? CUDA Programming and Performance	2	524	September 13, 2018
Strange Behavior on image processing CUDA Programming and Performance	3	1797	September 8, 2008
Speed improvement CUDA Programming and Performance	18	8269	December 5, 2008
Is optimization possible for this kernel? Nsight Compute cuda	0	1347	May 8, 2024
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	490	August 6, 2021
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	5965	October 29, 2008

DRAM Excessive Read Sectors

Related topics