an illegal memory access was encountered

hlei · November 8, 2017, 6:35am

I have search this question but I feel no useful information for me. The following is the code where it got error information:

checkCudaErrors(cudaMemcpy(dev_X, X->data, m*n * sizeof(float), cudaMemcpyHostToDevice));
		start2 = clock();
		dim3 Grid(40, m);
		dim3 Block(1024, 1);
		order2_kernel << <Grid, Block >> > (dev_X, dev_Xsub, m, n, k);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));
		cudaDeviceSynchronize();
		stop2 = clock();
		float perTime = stop2 - start2;
		totalTime = totalTime + perTime;

		dim3 sumGrid(40, M1);
		dim3 sumBlock(1024, 1);
		int sharedSize = sumBlock.x * sizeof(float);
		sumReduction_kernel << <sumGrid, sumBlock, sharedSize, 0 >> > (dev_Xmean, dev_Xsub, M1, n);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));  //here got an error (an illegal memory access was encountered)
		
		sub1_kernel << <sumGrid, sumBlock >> > (dev_XFinal, dev_Xsub, dev_Xmean, M1, n);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));  //here got an error (an illegal memory access was encountered)

A very very strange place is that when I use a smaller matrix (120×32400) to do a test, there is no error occured and it can output a right result. However, when the size of matrix is 224×40000, then there will be an error and the all following kernel will wrong. Finally, if I change the order2_kernel to a similar kernel, everything will be OK!

The following is the code of order2_kernel:

__global__ void order2_kernel(float *p, float *res, int m, int n, int k)
{
	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	int back_index = 2*k * n + tid;
	int ford_index = tid;
	int midl_index = k * n + tid;
#pragma unroll
	for (int i = 0; i < (m - 2 * k); i++)
	{
		res[i * n + tid] = p[back_index] + p[ford_index] - 2 * p[midl_index];
		if (res[i * n + tid] < 0) {
			res[i * n + tid] = fabs(res[i * n + tid]);
		}
		back_index += n;
		ford_index += n;
		midl_index += n;
	}
}

In the initialization settings, the m is the rows of every input matrix, and I set 1024 threads in a single block.
So I want to know what is the problem in my code, it’s so weird.

Any suggestion is appreciate,
many many thanks~

hlei · November 9, 2017, 6:51am

I am so sorry to post this twice,but I did cannot understand why this program will collapsed when the data is bigger.
What I have changed is just the launch parameter with the change of data, I also try two verify the data after the order2_kernel with the following code. But it collapsed when cudaMemcpy executed with cudaErrorIllegalAddress. (However, there was no error with the other data). So, hope you can give me some suggestion about this

checkCudaErrors(cudaMemcpy(dev_X, X->data, m*n * sizeof(float), cudaMemcpyHostToDevice));
		start2 = clock();
		dim3 Grid(40, m);
		dim3 Block(1024, 1);
		//bandSub_kernel << <Grid, Block >> > (dev_X, dev_Xsub, m, n, k);
		order2_kernel << <Grid, Block >> > (dev_X, dev_Xsub, m, n, k);
		printf("%s\n", cudaGetErrorString(cudaGetLastError()));
		cudaDeviceSynchronize();
		stop2 = clock();
		float perTime = stop2 - start2;
		totalTime = totalTime + perTime;
		float *h_Xsub = (float*)malloc(sizeof(float)*M1*n);
		checkCudaErrors(cudaMemcpy(h_Xsub, dev_Xsub, M1*n*sizeof(float), cudaMemcpyDeviceToHost));// here collapsed with a big matrix, but no error with a small
		writeFile(h_Xsub, m, n, "D:\\Eg\\error test\\h_Xsub.mat", "Xsub");

Robert_Crovella · November 9, 2017, 6:58am

If you want to debug the illegal address error, a method such as described here:

[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow

is one possibillity

hlei · November 9, 2017, 7:57am

I have not try with cuda-memcheck but enabled CUDA memory checker in Nsight, I got the following error in output:

Summary of access violations:
d:\eg\rx_stream_test\rx_gpu_acc\rx_gpu\rx_gpu.cuh(117): error MemoryChecker: #misaligned=0 #invalidAddress=32

Memory Checker detected 32 access violations.
error = access violation on load (global memory)
gridid = 95
blockIdx = {39,0,0}
threadIdx = {384,0,0}
address = 0x50602e500
accessSize = 4

The line 117 is the following line in order2_kernel:

<b>res[i * n + tid] = p[back_index] + p[ford_index] - 2 * p[midl_index];</b>

hlei · November 9, 2017, 1:22pm

I have search some question about this access violations, maybe its similar to the following

[url]Memory errors when writing to local variable in kernel - CUDA Programming and Performance - NVIDIA Developer Forums

But it still unsolved, it’s so weird why it will be wrong when a bigger matrix. I will try to test the single order2_kernel function.

A question is that uf there will be a violation when operate different rows’ elements in parallel?

Any more suggestions is appreciated…

hlei · November 9, 2017, 3:03pm

The following is the test program:

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include <mat.h>
#include "device_launch_parameters.h"

void writeFile(float* D, int rows, int cols, const char *path, const char *var)
{
	MATFile *pw = matOpen(path, "w");
	if (pw == NULL)
	{
		printf("File open failed");
	}
	mxArray *br = mxCreateDoubleMatrix(rows, cols, mxREAL);
	for (int i = 0; i < rows; i++)
	{
		for (int j = 0; j < cols; j++)
		{
			mxGetPr(br)[j * rows + i] = (float)D[i * cols + j];
		}
	}
	int status = matPutVariable(pw, var, br);
	if (status != 0)
	{
		printf("%s\n", "save result failed");
	}
	mxDestroyArray(br);
	matClose(pw);
}

__global__ void order2_kernel(float *p, float *res, int m, int n, int k)
{
	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	int back_index = 2 * k * n + tid;
	int ford_index = tid;
	int midl_index = k * n + tid;
#pragma unroll
	for (int i = 0; i < (m - 2 * k); i++)
	{
		res[i * n + tid] = p[back_index] + p[ford_index] - p[midl_index];  
		if (res[i * n + tid] < 0) {
			res[i * n + tid] = fabs(res[i * n + tid]);
		}
		back_index += n;
		ford_index += n;
		midl_index += n;
	}
}



void initA(float *p, int m, int n)
{
	for (int i = 0; i < m; i++)
	{
		for (int j = 0; j < n; j++)
		{
			p[i * n + j] = i * n + j;
		}
	}
}

int main()
{
	int m = 169;
	int n = 4096;
	int k = 1;
	size_t size = sizeof(float) * m * n;
	float *data = (float*)malloc(size);
	initA(data, m, n);
	int M = m - 2*k;
	float *dev_data;
	checkCudaErrors(cudaMalloc((void**)&dev_data, size));
	checkCudaErrors(cudaMemcpy(dev_data, data, size, cudaMemcpyHostToDevice));
	float *dev_result;
	checkCudaErrors(cudaMalloc((void**)&dev_result, sizeof(float) * M * n));
	dim3 grid(4, m);
	dim3 block(1024, 1);
	order2_kernel << <grid, block >> > (dev_data, dev_result, m, n, k);
	printf("%s\n", cudaGetErrorString(cudaGetLastError()));
	float *result = (float*)malloc(sizeof(float)*M * n);
	checkCudaErrors(cudaMemcpy(result, dev_result, sizeof(float) * M * n, cudaMemcpyDeviceToHost));
	writeFile(result, M, n, "D:\\Eg\\error test\\test.mat", "T");

	cudaFree(dev_data);
	cudaFree(dev_result);
	free(data);
	free(result);

	getchar();
	return 0;

}

I have test the order2_kernel with above code, there always an error the same as we have mentioned with cuda debugging in nsight. But in most cases, it can run to get a right result without cuda debugging but the number of columns is 40000. So this is why?

Now we may know the root of this problem maybe in the kernel computation? I just want to compute a value, which relate to three different rows. (The front row add the back row and then subtract the middle row, with the corresponding element). So where is the wrong place and any other way?

Robert_Crovella · November 9, 2017, 3:06pm

you’ve only completed the first step of the debug process. Now that you know a fault is occurring on line 117, insert additional code to test each computed index against the relevant limit, to see which index is out-of-bounds. Then work backwards from there, in a similar fashion, to find out why the index is out of bounds, and you will locate the bug in your code. i.e. replace line 117 with something like:

if (i * n + tid > res_max_index) printf("fault1: ....", ...);
else if (back_index > p_max_index) printf("fault2: ...", ...);
else if (ford_index > p_max_index) printf("fault3: ...", ...);
else if (midl_index > p_max_index) printf("fault4: ...", ...);
else
  res[i * n + tid] = p[back_index] + p[ford_index] - 2 * p[midl_index];

Once you see which sort of fault you get, work backwards. Find the code that computes the relevant index, and see (perhaps using in-kernel printf, like above) why the calculation is out-of-range. Yes, it is tedious and requires effort on your part. Welcome to debugging. Or you can use the debug capability built into nsight VSE. Which may also be tedious and requires effort on your part.

hlei · November 10, 2017, 1:46pm

Anyway, I have found the the bug in my code. I have done many duplication computation with too much launched blocks and then the thread is out of bound. I ever done little thing about cuda debugging, and just compare if the computation result is right. Maybe my understanding to parallel computation is not so clear I think， it’s ashamed.

Finally, many many many thanks to txbob, with my best regard. Your reply indicate a right direction for me.

Topic		Replies	Views
an illegal memory access was encountered CUDA Programming and Performance	0	932	November 8, 2017
illegal memory access - any help appreciated CUDA Programming and Performance	5	7130	February 8, 2018
Grid fails with error 700 (Illegal Memory Access), when reaching specific size, without obvious reason CUDA Programming and Performance	5	514	September 13, 2022
Why "an illegal memory access was encountered" when numbre block increase CUDA Programming and Performance cuda , kernel	5	5974	July 13, 2023
Tracking down CUDA illegal memory access CUDA Programming and Performance	1	1305	February 20, 2015
Alternately Encountering "Illegal Memory Access" Legacy PGI Compilers	4	6689	August 28, 2017
For_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered CUDA Programming and Performance	4	3003	May 9, 2021
Tracking Invalid read size and illegal memory access CUDA Programming and Performance	3	7797	May 24, 2016
Indexing inside the range getting llegal memory access was encountered error CUDA Programming and Performance	1	304	April 26, 2022
Memory Checker detected 12 access violations. error = access violation on load (global memory) CUDA Programming and Performance	6	1675	January 21, 2018

an illegal memory access was encountered

Related topics