Shared Memory Debug Errors in NSight

System Configuration:
Windows 7
Cuda 8.0.61
Driver 376.51

I’ve recently had an unusual problem debugging simple GPU code in Nsight. A few months ago I had written a GPU algorithm which ran perfectly well on my 980 Ti. My co-worker had recently build the algorithm on his machine with a 1060, but was getting bounds check errors in Cuda Debug mode. I went back to run again on my machine under the same conditions with the 980 Ti and everything was fine. I then tried a 960, 1060, and 1080 card on my machine and ALL showed the same bounds check errors. Cuda-memcheck reported NO errors for all of the video cards. I’ve created a very simple function that reproduces the weird behavior. You can see all the code is doing is setting up a block of shared memory, then iterating though it. All of the cards I’ve tried list 48k as their max shared memory size, so 15360 shouldn’t be a problem. The problem doesn’t appear on the 980 Ti, but does on a 960, 1060, and 1080. Thanks

An example error that gets reported is:

Summary of access violations:
c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\ error MemoryChecker: #misaligned=1 #invalidAddress=0

Memory Checker detected 1 access violations.
error = misaligned store (global memory)
gridid = 5
blockIdx = {17,6,0}
threadIdx = {0,0,0}
address = 0x10001000523
accessSize = 1

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// helper functions and utilities to work with CUDA
#include <helper_cuda.h>
#include <helper_functions.h>

__global__ void kernel_big()
	const int MEMSIZE = 15360;
	__shared__ unsigned char data[MEMSIZE];

	if (threadIdx.x == 0 && threadIdx.y == 0)
		for (int i = 0; i < MEMSIZE; i++)
			unsigned char tmp = data[i];

		for (int i = 0; i < MEMSIZE; i++)
			data[i] = 42;


//! Entry point for Cuda functionality on host side
extern "C" void
runTest(const int argc, const char **argv)
    // use command-line specified CUDA device, otherwise use device with highest Gflops/s
    findCudaDevice(argc, (const char **)argv);
	kernel_big << <dim3(24, 24, 1), dim3(16, 16, 1) >> >();

    // check if kernel execution generated and error
    getLastCudaError("Kernel execution failed");


Confirmed, it exists on my 1070 but doesn’t exist on my 980Ti, also it doesn’t exist on cuda-memcheck, I guess you should use cuda-memcheck to test the memory leak, sorry for the inconvenience

Thanks for confirming the problem Harry. Hopefully this will get on the list of bugs to fix because it makes debugging Cuda code more difficult.

Hi, I don’t see this issue on nsight 5.3, could you please check the latest version?

I’ll give it a try and get back to you. Thanks for testing this out.