an illegal memory access was encountered

I have search this question but I feel no useful information for me. The following is the code where it got error information:

checkCudaErrors(cudaMemcpy(dev_X, X->data, m*n * sizeof(float), cudaMemcpyHostToDevice));
		start2 = clock();
		dim3 Grid(40, m);
		dim3 Block(1024, 1);
		order2_kernel << <Grid, Block >> > (dev_X, dev_Xsub, m, n, k);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));
		cudaDeviceSynchronize();
		stop2 = clock();
		float perTime = stop2 - start2;
		totalTime = totalTime + perTime;

		dim3 sumGrid(40, M1);
		dim3 sumBlock(1024, 1);
		int sharedSize = sumBlock.x * sizeof(float);
		sumReduction_kernel << <sumGrid, sumBlock, sharedSize, 0 >> > (dev_Xmean, dev_Xsub, M1, n);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));  //here got an error (an illegal memory access was encountered)
		
		sub1_kernel << <sumGrid, sumBlock >> > (dev_XFinal, dev_Xsub, dev_Xmean, M1, n);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));  //here got an error (an illegal memory access was encountered)

A very very strange place is that when I use a smaller matrix (120×32400) to do a test, there is no error occured and it can output a right result. However, when the size of matrix is 224×40000, then there will be an error and the all following kernel will wrong. Finally, if I change the order2_kernel to a similar kernel, everything will be OK!

The following is the code of order2_kernel:

__global__ void order2_kernel(float *p, float *res, int m, int n, int k)
{
	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	int back_index = 2*k * n + tid;
	int ford_index = tid;
	int midl_index = k * n + tid;
#pragma unroll
	for (int i = 0; i < (m - 2 * k); i++)
	{
		res[i * n + tid] = p[back_index] + p[ford_index] - 2 * p[midl_index];
		if (res[i * n + tid] < 0) {
			res[i * n + tid] = fabs(res[i * n + tid]);
		}
		back_index += n;
		ford_index += n;
		midl_index += n;
	}
}

In the initialization settings, the m is the rows of every input matrix, and I set 1024 threads in a single block.
So I want to know what is the problem in my code, it’s so weird.

Any suggestion is appreciate,
many many thanks~

I am so sorry to post this twice,but I did cannot understand why this program will collapsed when the data is bigger.
What I have changed is just the launch parameter with the change of data, I also try two verify the data after the order2_kernel with the following code. But it collapsed when cudaMemcpy executed with cudaErrorIllegalAddress. (However, there was no error with the other data). So, hope you can give me some suggestion about this

checkCudaErrors(cudaMemcpy(dev_X, X->data, m*n * sizeof(float), cudaMemcpyHostToDevice));
		start2 = clock();
		dim3 Grid(40, m);
		dim3 Block(1024, 1);
		//bandSub_kernel << <Grid, Block >> > (dev_X, dev_Xsub, m, n, k);
		order2_kernel << <Grid, Block >> > (dev_X, dev_Xsub, m, n, k);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));
		cudaDeviceSynchronize();
		stop2 = clock();
		float perTime = stop2 - start2;
		totalTime = totalTime + perTime;
		float *h_Xsub = (float*)malloc(sizeof(float)*M1*n);
		checkCudaErrors(cudaMemcpy(h_Xsub, dev_Xsub, M1*n*sizeof(float), cudaMemcpyDeviceToHost));// here collapsed with a big matrix, but no error with a small
		writeFile(h_Xsub, m, n, "D:\\Eg\\error test\\h_Xsub.mat", "Xsub");
1 Like

If you want to debug the illegal address error, a method such as described here:

https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

is one possibillity

I have not try with cuda-memcheck but enabled CUDA memory checker in Nsight, I got the following error in output:

Summary of access violations:
d:\eg\rx_stream_test\rx_gpu_acc\rx_gpu\rx_gpu.cuh(117): error MemoryChecker: #misaligned=0 #invalidAddress=32

Memory Checker detected 32 access violations.
error = access violation on load (global memory)
gridid = 95
blockIdx = {39,0,0}
threadIdx = {384,0,0}
address = 0x50602e500
accessSize = 4

The line 117 is the following line in order2_kernel:

<b>res[i * n + tid] = p[back_index] + p[ford_index] - 2 * p[midl_index];</b>

I have search some question about this access violations, maybe its similar to the following

https://devtalk.nvidia.com/default/topic/980336/cuda-programming-and-performance/memory-errors-when-writing-to-local-variable-in-kernel/1

But it still unsolved, it’s so weird why it will be wrong when a bigger matrix. I will try to test the single order2_kernel function.

A question is that uf there will be a violation when operate different rows’ elements in parallel?

Any more suggestions is appreciated…

The following is the test program:

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include <mat.h>
#include "device_launch_parameters.h"

void writeFile(float* D, int rows, int cols, const char *path, const char *var)
{
	MATFile *pw = matOpen(path, "w");
	if (pw == NULL)
	{
		printf("File open failed");
	}
	mxArray *br = mxCreateDoubleMatrix(rows, cols, mxREAL);
	for (int i = 0; i < rows; i++)
	{
		for (int j = 0; j < cols; j++)
		{
			mxGetPr(br)[j * rows + i] = (float)D[i * cols + j];
		}
	}
	int status = matPutVariable(pw, var, br);
	if (status != 0)
	{
		printf("%s\n", "save result failed");
	}
	mxDestroyArray(br);
	matClose(pw);
}

__global__ void order2_kernel(float *p, float *res, int m, int n, int k)
{
	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	int back_index = 2 * k * n + tid;
	int ford_index = tid;
	int midl_index = k * n + tid;
#pragma unroll
	for (int i = 0; i < (m - 2 * k); i++)
	{
		res[i * n + tid] = p[back_index] + p[ford_index] - p[midl_index];  
		if (res[i * n + tid] < 0) {
			res[i * n + tid] = fabs(res[i * n + tid]);
		}
		back_index += n;
		ford_index += n;
		midl_index += n;
	}
}



void initA(float *p, int m, int n)
{
	for (int i = 0; i < m; i++)
	{
		for (int j = 0; j < n; j++)
		{
			p[i * n + j] = i * n + j;
		}
	}
}

int main()
{
	int m = 169;
	int n = 4096;
	int k = 1;
	size_t size = sizeof(float) * m * n;
	float *data = (float*)malloc(size);
	initA(data, m, n);
	int M = m - 2*k;
	float *dev_data;
	checkCudaErrors(cudaMalloc((void**)&dev_data, size));
	checkCudaErrors(cudaMemcpy(dev_data, data, size, cudaMemcpyHostToDevice));
	float *dev_result;
	checkCudaErrors(cudaMalloc((void**)&dev_result, sizeof(float) * M * n));
	dim3 grid(4, m);
	dim3 block(1024, 1);
	order2_kernel << <grid, block >> > (dev_data, dev_result, m, n, k);
	printf("%s\n", cudaGetErrorString(cudaGetLastError()));
	float *result = (float*)malloc(sizeof(float)*M * n);
	checkCudaErrors(cudaMemcpy(result, dev_result, sizeof(float) * M * n, cudaMemcpyDeviceToHost));
	writeFile(result, M, n, "D:\\Eg\\error test\\test.mat", "T");

	cudaFree(dev_data);
	cudaFree(dev_result);
	free(data);
	free(result);

	getchar();
	return 0;

}

I have test the order2_kernel with above code, there always an error the same as we have mentioned with cuda debugging in nsight. But in most cases, it can run to get a right result without cuda debugging but the number of columns is 40000. So this is why?

Now we may know the root of this problem maybe in the kernel computation? I just want to compute a value, which relate to three different rows. (The front row add the back row and then subtract the middle row, with the corresponding element). So where is the wrong place and any other way?

you’ve only completed the first step of the debug process. Now that you know a fault is occurring on line 117, insert additional code to test each computed index against the relevant limit, to see which index is out-of-bounds. Then work backwards from there, in a similar fashion, to find out why the index is out of bounds, and you will locate the bug in your code. i.e. replace line 117 with something like:

if (i * n + tid > res_max_index) printf("fault1: ....", ...);
else if (back_index > p_max_index) printf("fault2: ...", ...);
else if (ford_index > p_max_index) printf("fault3: ...", ...);
else if (midl_index > p_max_index) printf("fault4: ...", ...);
else
  res[i * n + tid] = p[back_index] + p[ford_index] - 2 * p[midl_index];

Once you see which sort of fault you get, work backwards. Find the code that computes the relevant index, and see (perhaps using in-kernel printf, like above) why the calculation is out-of-range. Yes, it is tedious and requires effort on your part. Welcome to debugging. Or you can use the debug capability built into nsight VSE. Which may also be tedious and requires effort on your part.

1 Like

Anyway, I have found the the bug in my code. I have done many duplication computation with too much launched blocks and then the thread is out of bound. I ever done little thing about cuda debugging, and just compare if the computation result is right. Maybe my understanding to parallel computation is not so clear I think, it’s ashamed.

Finally, many many many thanks to txbob, with my best regard. Your reply indicate a right direction for me.