Why does the solution blow up as the problem size increases?

I am trying to run N ordinary least squares (OLS) problems in parallel by CUDA. This OLS problem is to get beta from Y=Xbeta, where the size of Y is S, and the size of X is SD. The number of threads are 64, and blocks are 16. I parallelize the code by solving each OLS problem on one thread.
When S=200, D=3, and N=1024, with double precision, my code can give correct results. However, when the problem size increases, say, S=300, D=3, and N=1024, the solutions are very big number like: -6.27744e+066. But I can get a correct solution if I use single precision. This seems to be a memory error. The global memory of my GPU is 2GB, and the N Y and X arrays are just around 10 MB, so my GPU memory should be enough. I think the following reasons are possible:

(1) I write some Matrix class on device, in which I use new and delete to dynamically allocate global memory on GPU. Maybe I should use malloc and free instead?
(2) The register memory of block I use may exceed, but I have allocated the main arrays on global memory.
(3) The assignment operator and destructor implementation of my matrix class is not correct, but the host device version of the code can run correctly on CPU, no matter what the problem size.

I am not sure how to debug. Any suggestion? Thank you in advance.

In the failing case, have your run your code with cuda-memcheck? If so, what was the result reported by cuda-memcheck?

Thank you for your suggestions. I have made a simplified version of my code below. Cuda-memcheck locates the error in the assignment C.v[i*A.nn+j]=1.0;
The error message is:
Invalid global write of size 8
at 0x00000288 in e:\…\NRmatrix.cuh:315:NRmatrix trans(NRmatrix const &)
by thread (58,0,0) in block (1,0,0)
Address 0x00000000 is out of bounds

It seems that the constructor fails to allocate memory, but this only happens when the problem size is big.
So is there any solution to this problem?

template <class T>
class NRmatrix {
public:
	bool ConByRef;
	int nn;
	int mm;
	T *v;
__host__ __device__ NRmatrix(int n, int m);
}

template <class T>
NRmatrix<T>::NRmatrix(int n, int m) : nn(n), mm(m),ConByRef(false), v(n>0&&m>0 ? (T *)malloc(sizeof(T)*n*m): NULL)
{}

template<typename T> __host__ __device__
NRmatrix<T> trans(const NRmatrix<T>& A){
	NRmatrix<T> C(A.mm,A.nn);
	for (int i=0;i<A.mm;i++){
		for (int j=0;j<A.nn;j++){
			C.v[i*A.nn+j]=1.0;//memcheck locates the error here
}
}
	return C;
}

if you are using in-kernel new or malloc, you should test the returned pointer for NULL (Address 0x00000000 is out of bounds), before attempting to use it. If it is NULL it means the malloc or new failed.

Read here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

Thank you Robert. But why does malloc on device fail? My GPU memory should be enough.
And what can I do if malloc on device fails?
I have tried code like below by trying to allocate memory until succeed, but it will terminate abnormally, when running.
Is there any code example to solve this problem?

NRmatrix<T>::NRmatrix(int n, int m) : nn(n), mm(m),ConByRef(false)
{
	while(v==NULL){
		v = (T *)malloc(sizeof(T)*n*m);
	}
}

I think the problem is here from your link:
“A default heap of eight megabytes is allocated if any program uses malloc() without explicitly specifying the heap size.”
After I setting up the heap size to 1GB with the following code
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The memory error has gone.