Error to profile kernel in Nsight Compute

I am trying the global atomics vs shared atomics code from NVIDIA blog https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

But when I am trying to profile with Nsight Compute CLI, it shows an error for the shared atomics kernel.

==PROF== Connected to process 16078
==PROF== Profiling "histogram_gmem_atomics" - 0: 0%....50%....100% - 1 pass
==PROF== Profiling "histogram_smem_atomics" - 1: 0%....50%....100% - 1 pass

==ERROR== LaunchFailed

==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[16078] histogram@127.0.0.1
  histogram_gmem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__bytes.sum.per_second                                                Gbyte/second                          13,98
    ---------------------------------------------------------------------- --------------- ------------------------------

  histogram_smem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__bytes.sum.per_second                                                 byte/second                        (!) nan
    ---------------------------------------------------------------------- --------------- ------------------------------

Why is this showing an error in ncu? For referance my main function looks like this:

#define NUM_BINS 480
#define NUM_PARTS 48

struct IN_TYPE
{
    int x;
    int y;
    int z;
};

int main(){
    int height = 480;
    int width = height;

    auto nThread = 16;
    auto nBlock = (height) / nThread;

    IN_TYPE* h_in_image, *d_in_image;
    unsigned int* d_out_image;
    h_in_image = (IN_TYPE *)malloc(height*width * sizeof(IN_TYPE));
    cudaMalloc(&d_in_image, height*width * sizeof(IN_TYPE));
    cudaMalloc(&d_out_image, height*width * sizeof(unsigned int));

    for (int n = 0; n < (height*width); n++)
    {
        h_in_image[n].x = rand()%10;
        h_in_image[n].y = rand()%10;
        h_in_image[n].z = rand()%10;
    }
    cudaMemcpy(d_in_image, h_in_image, height*width * sizeof(IN_TYPE), cudaMemcpyHostToDevice);

    histogram_gmem_atomics<<<nBlock, nThread>>>(d_in_image, width, height, d_out_image);
    cudaDeviceSynchronize();

// not copying the results back as of now

    histogram_smem_atomics<<<nBlock, nThread>>>(d_in_image, width, height, d_out_image);
    cudaDeviceSynchronize();

}

It’s hard to tell what is causing the issue. It could be a bug with the code itself. If possible, can you ensure that the application executes without error without the profiler, specifically from the cudaDeviceSynchronize calls after the kernels. You can do this with error handling or stepping through with a debugger. that will catch the exceptions. If an error is occurring on the device, you may not be seeing it. Also, what GPU and Nsight Compute version do you have?