CUDA Matrix multiplication breaks for large matrices

I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of “Width” (Matrix width) up to about 2500 or so (in under a second).

[codebox]int size = WidthWidthsizeof(float);

float* Md, *Nd, *Pd;

cudaError_t err = cudaSuccess;

//Allocate Device Memory for M, N and P

err = cudaMalloc((void**)&Md, size);

err = cudaMalloc((void**)&Nd, size);

err = cudaMalloc((void**)&Pd, size);

//Copy Matrix from Host Memory to Device Memory

err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

//Free Device Memory

cudaFree(Md);

cudaFree(Nd);

cudaFree(Pd);[/codebox]

When I set the “Width” to 3000 or greater, I get the following error after a black screen:

I looked online and I saw that some people has this issue because the watchdog was killing the kernel after it hangs for more than the time specified in “TdrDelaty” seconds. I added TdrDelay as a REG_DWORD to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Contol\GraphicsDrivers and set a time of like 30 seconds. After 30 seconds, I get the same error. When I set TdrLevel to 0, it just freezes… I get no error, but I get no response from my machine. Am I exceeding memory capacity somewhere? Any help would be greatly appreciated!!!

I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of “Width” (Matrix width) up to about 2500 or so (in under a second).

[codebox]int size = WidthWidthsizeof(float);

float* Md, *Nd, *Pd;

cudaError_t err = cudaSuccess;

//Allocate Device Memory for M, N and P

err = cudaMalloc((void**)&Md, size);

err = cudaMalloc((void**)&Nd, size);

err = cudaMalloc((void**)&Pd, size);

//Copy Matrix from Host Memory to Device Memory

err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

//Free Device Memory

cudaFree(Md);

cudaFree(Nd);

cudaFree(Pd);[/codebox]

When I set the “Width” to 3000 or greater, I get the following error after a black screen:

I looked online and I saw that some people has this issue because the watchdog was killing the kernel after it hangs for more than the time specified in “TdrDelaty” seconds. I added TdrDelay as a REG_DWORD to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Contol\GraphicsDrivers and set a time of like 30 seconds. After 30 seconds, I get the same error. When I set TdrLevel to 0, it just freezes… I get no error, but I get no response from my machine. Am I exceeding memory capacity somewhere? Any help would be greatly appreciated!!!