I am trying the global atomics vs shared atomics code from NVIDIA blog https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/
But when I am trying to profile with Nsight Compute CLI, it shows an error for the shared atomics kernel.
==PROF== Connected to process 16078
==PROF== Profiling "histogram_gmem_atomics" - 0: 0%....50%....100% - 1 pass
==PROF== Profiling "histogram_smem_atomics" - 1: 0%....50%....100% - 1 pass
==ERROR== LaunchFailed
==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[16078] histogram@127.0.0.1
histogram_gmem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes.sum.per_second Gbyte/second 13,98
---------------------------------------------------------------------- --------------- ------------------------------
histogram_smem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes.sum.per_second byte/second (!) nan
---------------------------------------------------------------------- --------------- ------------------------------
Why is this showing an error in ncu? For referance my main function looks like this:
#define NUM_BINS 480
#define NUM_PARTS 48
struct IN_TYPE
{
int x;
int y;
int z;
};
int main(){
int height = 480;
int width = height;
auto nThread = 16;
auto nBlock = (height) / nThread;
IN_TYPE* h_in_image, *d_in_image;
unsigned int* d_out_image;
h_in_image = (IN_TYPE *)malloc(height*width * sizeof(IN_TYPE));
cudaMalloc(&d_in_image, height*width * sizeof(IN_TYPE));
cudaMalloc(&d_out_image, height*width * sizeof(unsigned int));
for (int n = 0; n < (height*width); n++)
{
h_in_image[n].x = rand()%10;
h_in_image[n].y = rand()%10;
h_in_image[n].z = rand()%10;
}
cudaMemcpy(d_in_image, h_in_image, height*width * sizeof(IN_TYPE), cudaMemcpyHostToDevice);
histogram_gmem_atomics<<<nBlock, nThread>>>(d_in_image, width, height, d_out_image);
cudaDeviceSynchronize();
// not copying the results back as of now
histogram_smem_atomics<<<nBlock, nThread>>>(d_in_image, width, height, d_out_image);
cudaDeviceSynchronize();
}