Memory Checker detected 12 access violations. error = access violation on load (global memory)

when I use cuda debug to debug my code, there is always error like this:

Memory Checker detected 12 access violations.
error = access violation on load (global memory)
gridid = 1
blockIdx = {2,0,0}
threadIdx = {3,25,0}
address = 0x03f100d8
accessSize = 4

sometimes it shows “error = access violation on load (share memory)”, but I don’t use any share memory.

However, when I decrease my input data size, it won’t hint error, why? Is there any possibility that too large data size causes this kind of error?

Here is my kernel function:

__global__ void DefferentialSysMatrixKernel(int* k, int** address, float** length, int* kDiffer, int** addressDiffer, float** lengthDiffer)
{
	int x = threadIdx.x + blockDim.x*blockIdx.x;
	int y = threadIdx.y + blockDim.y*blockIdx.y;

	if ((x < detecterN - 1) && (y < frameN))
	{
		int rayIndex = y*detecterN + x;
		int index = y*(detecterN - 1) + x;

		int *addr1 = address[rayIndex];
		int *addr2 = address[rayIndex + 1];
		int * addr3 = addressDiffer[index];
		float *leng1 = length[rayIndex];
		float *leng2 = length[rayIndex + 1];
		float *leng3 = lengthDiffer[index];
		int num1 = k[rayIndex];
		int num2 = k[rayIndex + 1];

		int t1 = 0;
		int t2 = 0;
		int t3 = 0;
		while ((t1 < num1) && (t2 < num2))
		{
			if (addr1[t1] == addr2[t2])
			{
				addr3[t3] = addr1[t1];
				leng3[t3] = leng2[t2] - leng1[t1];
				t1++;
				t2++;
				t3++;
			}
			else if (addr1[t1] < addr2[t2])
			{
				addr3[t3] = addr1[t1];
				leng3[t3] = -leng1[t1];
				t1++;
				t3++;
			}
			else
			{
				addr3[t3] = addr2[t2];
				leng3[t3] = leng2[t2];
				t2++;
				t3++;
			}
		}
		while (t1 < num1)
		{
			addr3[t3] = addr1[t1];
			leng3[t3] = -leng1[t1];
			t1++;
			t3++;
		}
		while (t2 < num2)
		{
			addr3[t3] = addr2[t2];
			leng3[t3] = leng2[t2];
			t2++;
			t3++;
		}
		kDiffer[index] = t3;

	}
}

It is a little complicated. I use it to calculate difference between a set of sparse vectors. Parameters detecterN and frameN are Macro definitions according to input data size.

I check it for many times to make sure that array subscripts not out of bounds.

Any one can tell me why?

Could mean you’re dereferencing uninitialized pointers in unified memory space that (by chance) point to shared memory.

I wouldn’t be surprised if you’re not handling the double pointers correctly:

DefferentialSysMatrixKernel(int* k, int** address, float** length,
                                       ^^               ^^

because it’s not trivial to get that right. But it’s impossible to say without seeing the host code.

with a little bit of effort you can have the cuda-memcheck tool narrow down the error to a specific line of kernel source code:

https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

Thanks for your advice. I have tried to use cuda-memcheck, but it succeed and hint no error, here is the output:

E:\Learning\Bei Hang\lab\statistical iteration reconstruction\code\PCMAP_cuda_modify_2\x64\Debug>cuda-memcheck PCMAP.exe
========= CUDA-MEMCHECK
Read file successfully
System Matrix Loaded successfully!
Weight Loaded successfully!
cpu load time: 0.843s
total iteration time: 117.613min

These output are set in code by myself. So it means no error?

Howerver, I still get error massage when I use cuda debug in VS2015. Why?

Also, you can see it costs almost 2 hours when I use cuda-memcheck. But in VS, when I run the code, not in cuda debug, it can finish in just twenty seconds. What causes this?

Also, I always have the question that whether my double pointer assigned correctly.

Actually, my input double pointers are assigned like this:

cudaMalloc((void**)&d_k, rayN * sizeof(int));
	cudaMalloc((void**)&d_addr, numsum * sizeof(int));
	cudaMalloc((void**)&d_leng, numsum * sizeof(float));

	cudaMemcpy(d_k, k, rayN * sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(d_addr, address, numsum * sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(d_leng, length, numsum * sizeof(float), cudaMemcpyHostToDevice);

        int** addrLocate = (int**)malloc(rayN * sizeof(int*));
	float** lengLocate = (float**)malloc(rayN * sizeof(float*));

	t = 0;
	for (i = 0; i < rayN; i++)
	{
		addrLocate[i] = d_addr + t;
		lengLocate[i] = d_leng + t;
		t = t + k[i];
	}

	int** d_addrLocate = NULL;
	float** d_lengLocate = NULL;

	cudaMalloc((void**)&d_addrLocate, rayN * sizeof(int*));
	cudaMalloc((void**)&d_lengLocate, rayN * sizeof(float*));

	cudaMemcpy(d_addrLocate, addrLocate, rayN * sizeof(int*), cudaMemcpyHostToDevice);
	cudaMemcpy(d_lengLocate, lengLocate, rayN * sizeof(float*), cudaMemcpyHostToDevice);

	free(addrLocate);
	free(lengLocate);

Then the parameters int** d_addrLocate and int** d_lengLocate are passed to the parametric parameters int** address and int** locate.

I doubt that when data size(that is numsum) is too big, the memory that pointers int* d_addr and int* d_leng point to are not continues, which might caused double pointers pointed to wrong places.

However, I don’t get any wrong massage when I use cuda-memcheck. Do you know why?

Thank you a lot anyway :)

If cuda-memcheck is working properly, at the end of your program printout it will print a line something like this:

========= ERROR SUMMARY: 0 errors

I don’t see that in your printout. Either you cut it off when you were posting, or something is wrong.

Yes, cuda-memcheck makes your code run much slower. This is documented in the cuda-memcheck manual.

Ah, yes! I added system(“pause”) at end of my code, so it was actually stopped at here.
Thank you!