NCU Error for Vectorized Memory Access

I’m writing a kernel to turn an image into grayscale. In my main function, I load a JPG image, reorganize its content so that instead of rgbrgbrgb...rgb, it’s rrr...rggg...gbbb...b, load it into the GPU RAM, run the kernel, load the result into CPU RAM, organize the result back into its original format, and write it out. I am trying to use vectorized memory access in my kernel, however, it is not working.

Inside the NCU app, when I profile this kernel, I get the following error:

Code inside main function

// GPU SoA Improved //
    int width, height, channels;
    unsigned char *image = stbi_load("image.jpg", &width, &height, &channels, 0);
    printf("width: %d, height: %d, channels: %d\n", width, height, channels);

    unsigned char *reorganizedImage = (unsigned char *)malloc(width * height * channels * sizeof(unsigned char));
    for (int counter = 0; counter < width * height; counter++)
        reorganizedImage[0 * width * height + counter] = image[counter * channels];
        reorganizedImage[1 * width * height + counter] = image[counter * channels + 1];
        reorganizedImage[2 * width * height + counter] = image[counter * channels + 2];

    unsigned char *h_image_out;
    unsigned char *d_image_out;

    cudaHostAlloc((void **)&h_image_out, width * height * channels, cudaHostAllocDefault);
    memcpy(h_image_out, reorganizedImage, width * height * channels);
    cudaMalloc((void **)&d_image_out, width * height * channels);
    cudaMemcpy(d_image_out, h_image_out, width * height * channels, cudaMemcpyHostToDevice);

    double start = tick();
    dim3 blockDimensions = {512};
    dim3 gridDimensions = {(width * height + blockDimensions.x - 1) / (blockDimensions.x * 4)};
    applyGrayscaleFilterSoAImproved<<<gridDimensions, blockDimensions>>>((uchar4 *)d_image_out, width, height);
    double end = tick();
    printf("Time taken by GPU SoA Improved: %f\n", end - start);

    cudaMemcpy(h_image_out, d_image_out, width * height * channels, cudaMemcpyDeviceToHost);
    memcpy(reorganizedImage, h_image_out, width * height * channels);

    for (int counter = 0; counter < width * height; counter++)
        image[counter * channels] = reorganizedImage[0 * width * height + counter];
        image[counter * channels + 1] = reorganizedImage[1 * width * height + counter];
        image[counter * channels + 2] = reorganizedImage[2 * width * height + counter];


    stbi_write_jpg("image_out_SoAImproved.jpg", width, height, channels, image, 100);

Kernel code

__global__ void applyGrayscaleFilterSoAImproved(uchar4 *image, int width, int height)
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    if (index < width * height)
        uchar4 red = image[0 * width * height + index];
        uchar4 green = image[1 * width * height + index];
        uchar4 blue = image[2 * width * height + index];

        uchar4 gray = make_uchar4(
            (red.x + green.x + blue.x) / 3,
            (red.y + green.y + blue.y) / 3,
            (red.z + green.z + blue.z) / 3,
            (red.w + green.w + blue.w) / 3
        image[0 * width * height + index] = gray;
        image[1 * width * height + index] = gray;
        image[2 * width * height + index] = gray;


Your kernel launch appears to be failing. Before attempting to use the profiler, make sure your application runs correctly under compute-sanitizer. Do not attempt to use a profiler until compute-sanitizer indicates no problems with your code.

Thank you for your suggestion, I ran compute-sanitizer and this is what I got:
log.txt.gz (31.2 KB)

I’m not sure why I’m getting so many invalid memory reads. I wonder if it has to do with recasting d_image_out, which is the device memory holding image data, as uchar4 * has something to do with it? My goal is to get pixel data for four pixels as a single memory access operation. When all 32 threads in a warp do that, it’ll utilize the 128 bytes of the memory transaction completely.

I’m not sure why I’m getting so many invalid memory reads.

Sounds like you have some debug work to do. At any rate, somewhat repetitious, my suggestion would be to not bother with any of the profilers until the application produces the correct answer, and also runs clean under compute-sanitizer.

Do as you wish, of course, but in case you are looking for additional suggestions from me:

  1. To debug kernel execution errors reported by compute-sanitizer, I would start with a process like this to localize the error. (It works similarly with cuda-memcheck or compute-sanitizer).

  2. If you want help on a public forum, you may get better responses by providing a short, complete application, rather than bits and pieces.

  3. I always recommend posting files as inline text, rather than using the forum file attachment process. A text file, such as an error log, or code, or things of that nature, are better presented that way.

Again, just my opinions/suggestions, do as you wish.

Thank you so much for your advice and suggestions. I figured out what was wrong. You can see that in the kernel code in my original post that I have not adjusted the total values to digest for the fact that I am doing four times as less “iterations” when processing 4 times as much pixels per thread. Dividing width * height by four took care of all the invalid reads.

I apologize for uploading the log file. It was about 2.5 MB without being compressed and I thought it would be weird to inline that much text.

In the case of compute-sanitizer or cuda-memcheck output, usually just the first 20 lines or so of output is sufficient to demonstrate whatever the tool is communicating. An individual fault report from these tools typically occupies a dozen or so lines, and usually what you will find in large output logs is that variations on these are repeated over and over. It’s usually not necessary to have all the variations in view to map out a course of action.