CUBLAS_STATUS_MAPPING_ERROR when retrieving result after cublasSgemm

In NVIDIA’s SDK project simpleCUBLAS, two NxN matricies are being multiplied.

I changed the size of N to be 4700. I have a GeForce 8600M GT w/ 512 MB of RAM…

When I run the program (simpleCUBLAS), memory for device Matricies A,B, and C are allocated correctly. cublasSgemm(…) is invoked and has a status of CUBLAS_STATUS_SUCCESS.

However, when I attempt to read the result back:

NVIDIA SDK Project - (C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA CUDA SDK\projects\simpleCUBLAS)

/* Read the result back */
status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, “!!! device access error (read C)\n”);

I get an error “!!! device access error (read C)”… When checking the error type, the actual error is CUBLAS_STATUS_MAPPING_ERROR…

I have had this same problem in my own code when attempting to retrieve a matrix multiply via cublasGetVector(…) and cublasGetMatrix(…).

I have also run into this same problem on a GeForce 9650M w/ 1 Gig of RAM - however, the matrix size NxN needs to be around 7000x7000 before yielding a CUBLAS_STATUS_MAPPING_ERROR…

How can cublas successfully allocate GPU memory, operate on that memory, but fail to retrieve the results back to the host?

I am going to guess that you are hitting the driver watchdog timer limit. The CUBLAS sgemm kernel is being killed (and probably CUBLAS is losing its context) because it is taking too long to complete.

When I run this on Windows Vista, execution always quits after 5 seconds. However, in Windows XP the execution time for some matricies takes about 20 seconds…

What you are saying makes sense - in WinXP, when execution fails, it typically takes a smaller amount of time to fail and exit than to complete successfully on a smaller data set.

Is there a way to work around the watch dog timer? Right now, my video card is attached to a display (laptop).

Is the watch dog timer only active when the video card is attached to a display?

How else can i retrieve several hundred MBs of contiguous floating point data using only one device handle?

The watchdog behaviour is OS specific and I only do linux (where you can get around it), so I can’t help you with that.

But to be clear, it isn’t the memory copy which is have problems, it is the sgemm call. You can fill the or read back the entire device memory of a 1Gb card in a few hundred milliseconds, which is never going to be problematic. But a single big monolithic cublas kernel can take a while and cause problems.

From my understanding, yes. You should be able to get a second card (cheap, cuda capability NOT required) to use for display purposes and free up your CUDA card for processing and not have to worry about the watchdog.

I’m not sure what would happen if you physically unplugged your display before running CUDA but I doubt it would fix anything because the card would presumably still be the primary display device?

Thanks for the input… To resolve this issue, I installed Ubuntu 9.10 w/ the CUDA 2.2 drivers… Prior to running my kernel, I invoked /etc/init.d/gdm stop (to turn off X)… This bypasses the whole watchdog issue…

The instructions for get CUDA working under Linux can be found here (this also works on Ubuntu 8.04, 8.10, and 9.10 w/ the standard desktop install (+ some dev packages)…


I am having the same problem - could you tell me how to get around it in linux? I wouldn’t like to stop gdm if possible ( but I am fine waiting several seconds with inresponsive X while the calculation is being processed)