I am trying to run N ordinary least squares (OLS) problems in parallel by CUDA. This OLS problem is to get beta from Y=Xbeta, where the size of Y is S, and the size of X is SD. The number of threads are 64, and blocks are 16. I parallelize the code by solving each OLS problem on one thread.
When S=200, D=3, and N=1024, with double precision, my code can give correct results. However, when the problem size increases, say, S=300, D=3, and N=1024, the solutions are very big number like: -6.27744e+066. But I can get a correct solution if I use single precision. This seems to be a memory error. The global memory of my GPU is 2GB, and the N Y and X arrays are just around 10 MB, so my GPU memory should be enough. I think the following reasons are possible:
(1) I write some Matrix class on device, in which I use new and delete to dynamically allocate global memory on GPU. Maybe I should use malloc and free instead?
(2) The register memory of block I use may exceed, but I have allocated the main arrays on global memory.
(3) The assignment operator and destructor implementation of my matrix class is not correct, but the host device version of the code can run correctly on CPU, no matter what the problem size.
I am not sure how to debug. Any suggestion? Thank you in advance.