Why “HW exception on 1 warps”?

Why “HW exception on 1 warps”?
I used Nsight to debug my cuda code. Here is the error message:
CUDA Debugger detected HW exception on 1 warps. First warp:
blockIdx = {23,0,0}
threadIdx = {0,0,0}
Exception = Out of range Address
PC = 0x002c69c8
FunctionRelativePC = ZN8NRmatrixIdEaSERKS0+000e88

My code is implemented in a C++ class member function. It can run perfectly on CPU. On GPU, sometime the code can also run well and give a correct result. But more often, this error shows up and Nsight locates different lines of my code at different runs.
I can show some code where this error occurs:

//CASE1:
for (i=0; i< nn; i++) for (j=0; j<mm; j++) v[i*mm+j] = rhs.v[i*mm+j];
//CASE2:
	for (int i = 0; i < size; i++)
		C.v[i] = L*R.v[i];
//CASE3:
	for (int i = i0; i < ie; i++){
		for (int j = j0; j < je;j++)
			v[i*mm + j] = rhs.v[(i-i0)*rhs.mm+j-j0];
	}

In all of the cases, this error occurs when there is an assignment to v, where v is a pointer to a double type array; it is a member of a class; its memory is allocated dynamically on device by calling function malloc. Rhs or R is another object of the same class. I guess the problem may be due to the dynamic memory allocation of v.

The reported error is this out of range address.
Print the indices of these assignments along the loops, [i*mm + j], [(i-i0)*rhs.mm+j-j0] or anything else, and see where it stops.

You can allocate stuff with malloc(), as long as you copy the contents of the host array to a device array and pass this device array to your kernel instead of the host array, as you are probably doing. But without the allocation code I can’t really comment more.

Thankyou.v is allocated in the constructor like below:

template <class T> __host__ __device__
NRmatrix<T>::NRmatrix(int n, int m, const T &a) : nn(n), mm(m),ConByRef(false)
{
	v = (T *)malloc(sizeof(T)*n*m);
	for(int i=0;i<nn;i++){
		for(int j=0;j<mm;j++){
			v[i*mm+j]=a;
		}
	
	}

}

The error only occurs occasionally. It seems that there is something unstalbe in this allocation method.

Can you modify a copy of this constructor so that you use cudaMallocManaged in this allocation?
Just to see what happens. I’m not sure an allocation problem would cause memcheck to report an out of range error.

I don’t have a dev environment right now, but you can give a try and let us know.
Make sure you have that famous macro __CUDA_SAFE_CALL (or something else) and call cudaMallocManaged with it. If there is allocation problem it will be explicitly reported.

EDIT:
I just found this, you may want to get some info from it.
https://devtalk.nvidia.com/default/topic/802257/working-with-cuda-and-class-methods/

The error is of a type that is occurring in device code. You cannot use cudaMallocManaged in device code, nor in any function marked with host device

Since you are using in-kernel malloc here, you may be running out of heap space. In that event, malloc will return a null pointer, which would be invalid to use.

In these situations, test to see if malloc has returned a null pointer, for error detection purposes.

And read this section of the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

Thanks for above observations, txbob. Always learning…

Since my CUDA codes have the minimum necessary sophistication, I never really get to see some specific problems.

Thank you all. I have added some code to check malloc. It does not return a null pointer. Also I have used CUDA Warp Watch to debug. Below is the screen print. It seems that the WarpWatch can access all the variable and memory locations normally, but there is still an “Out of range Address” exception.

If I use cuda-memcheck,it reports no error.It seems the error occurs only with Nsight.If I enable memory check of Nsight. It reports following:
Memory Checker detected 1 access violations.
error = access violation on load (global memory)

It may possibly be a defect in Nsight VSE

You could report a bug if you wish. The method to do so is contained in a sticky post at the top of this sub-forum. You would likely be asked for a complete code, not just the snippets you have shown here.

In the general Ctrl+F5 mode, the code cannot run either. I guess the problem can be shown in the demo code below.

template <class T>
class BB{
public:
__host__ __device__ BB(){};
__host__ __device__ fun(){};

NRmatrix<T> bb;//NRmatrix is anther template class with a default constructor;

};
template <class T>
BB<T>::fun(){
//case 1
bb= NRmatrix<T>(arguments);//the assessment error occurd in the assignment operator of NRmatrix
//case 2
NRmatrix<T> tmp(arguments);
bb=tmp;//no error
}

In case1, a temporary object of NRmatrix is generated first, then it is assigned to the class member bb of class. But it seems that this temporary object is unstable on GPU memory. During its assignment to bb, it may be corrupted. But in case2, object tmp is more stable than the anonymous one in case 1. So it can be assigned to bb.
Actually I prefer case 1,since if case 1 fails, matrix operations like X=(A+B)C may also fail. I have to write
NRmatrix tmp=A+B;
X=tmp
C;