CUBLAS_STATUS_MAPPING_ERROR in cublasGetMatrix() after cublasDgemm()

Hello:

I’m trying to use cublasDgemm() for large matrices and I obtain a CUBLAS_STATUS_MAPPING_ERROR in a strange behavior: first cublasDgemm() and cublasGetMatrix() run apparently well but after several calls I obtain the error when I try to get the matrix to main memory.

The hardware and drivers:

GPU: GeForce GTX 550 Ti
GPU Memory: 1535.2 MB
Driver: NVIDIA-Linux-x86_64-310.32.run (downloaded from www.nvidia.com)
CUDA version: 5.0 (also with 4.2)
Operating system: Debian GNU/Linux 64 bits (gcc 4.7.2)

I’m writing some code in order to compute the performance of cublasDgemm() in my GPU, so I execute the function several times for several matrix dimensions (all matrices are square). The dimensions are from 500 to 7500 in steps of 1000.

As cublasDgemm() uses 3 matrices C=alphaAB+betaC, in the case of dimensions 7500 the total amount of memory used in double precision is 3750075008/1024/1024 = 1287.5 MB, which is less than the total memory in the GPU

The computations run OK for dimensions 500 … 6500. The problem comes when the dimensions are 7500. But it is a bit strange: in my case, with dimensions 7500 the cumputations succeed the first three times, but at the fourth repetition, when I try to get the matrix C from the GPU I obtain the error CUBLAS_STATUS_MAPPING_ERROR. Then the computer becomes totally blocked and the only solution is to reboot the machine.

I think I free correctly all the memory used, so I don’t know where can be the problem.

Below is the code pasted. Can anyone with a similar card try to execute it and post the results?

Thanks

#include<stdio.h>
#include<stdlib.h>
#include<cuda_runtime.h>
#include<cublas_v2.h>
//dimension limit M=N=K
#define DLIM 7500
//repetitions
#define R 10
//function that calls cublasDgemm
void gpudgemm(int N,double* A,double* B,double* C);
//function that gets error code in CUDA
void cudaerrorcode(cudaError_t status);
//function that gets error code in CUBLAS
void cublaserrorcode(cublasStatus_t status);

int main()
{
    //variable declaration
    int i=0,j=0;
    double* A=NULL;
    double* B=NULL;
    double* C=NULL;
    //memory allocation and initialization
    A = (double*)malloc(DLIM*DLIM*sizeof(double));
    B = (double*)malloc(DLIM*DLIM*sizeof(double));
    C = (double*)malloc(DLIM*DLIM*sizeof(double));
    if((A==NULL)||(B==NULL)||(C==NULL))
    {
        fprintf(stderr,"Error in malloc()\n");
        exit(EXIT_FAILURE);
    }
    for(i=0;i<DLIM*DLIM;i++)
    {
        A[i] = 1.0;
        B[i] = 1.0;
        C[i] = 1.0;
    }
    //computations
    for(i=500;i<=DLIM;i+=1000)
    {
        fprintf(stderr,"M=N=K= %d\n",i);
        for(j=0;j<R;j++)
        {
            fprintf(stderr,"Computation %d ",j);
            fflush(stderr);
            gpudgemm(i,A,B,C);
            fprintf(stderr,"...done\n");
        }
        fprintf(stderr,"\n");
    }
    //memory free
    free(A);
    free(B);
    free(C);
    //end of function
    return 0;
}

void gpudgemm(int N,double* A,double* B,double* C)
{
    //variable declaration
    double ab=1.0;
    double* cA=NULL;
    double* cB=NULL;
    double* cC=NULL;
    cublasHandle_t handle;
    cublasOperation_t TRANSA=CUBLAS_OP_N;
    cublasOperation_t TRANSB=CUBLAS_OP_N;
    cublasStatus_t status1=CUBLAS_STATUS_SUCCESS;
    cublasStatus_t status2=CUBLAS_STATUS_SUCCESS;
    cublasStatus_t status3=CUBLAS_STATUS_SUCCESS;
    cudaError_t Status1=cudaSuccess;
    cudaError_t Status2=cudaSuccess;
    cudaError_t Status3=cudaSuccess;
    //memory allocation and data copy to GPU
    status1 = cublasCreate(&handle);
    if(status1!=CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr,"Error in cublasCreate()\n");
        fprintf(stderr,"CUBLAS error code: ");
        cublaserrorcode(status1);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    Status1 = cudaMalloc((void**)&cA,(size_t)(N*N)*sizeof(*A));
    Status2 = cudaMalloc((void**)&cB,(size_t)(N*N)*sizeof(*B));
    Status3 = cudaMalloc((void**)&cC,(size_t)(N*N)*sizeof(*C));
    if((Status1!=cudaSuccess)||
       (Status2!=cudaSuccess)||
       (Status3!=cudaSuccess))
    {
        fprintf(stderr,"Error in cudaMalloc()\n");
        fprintf(stderr,"CUDA error code cA: ");
        cudaerrorcode(Status1);
        fprintf(stderr,"\n");
        fprintf(stderr,"CUDA error code cB: ");
        cudaerrorcode(Status2);
        fprintf(stderr,"\n");
        fprintf(stderr,"CUDA error code cC: ");
        cudaerrorcode(Status3);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    status1 = cublasSetMatrix(N,N,(int)sizeof(*A),(const void*)A,N,(void*)cA,N);
    status2 = cublasSetMatrix(N,N,(int)sizeof(*B),(const void*)B,N,(void*)cB,N);
    status3 = cublasSetMatrix(N,N,(int)sizeof(*C),(const void*)C,N,(void*)cC,N);
    if((status1!=CUBLAS_STATUS_SUCCESS)||
       (status2!=CUBLAS_STATUS_SUCCESS)||
       (status3!=CUBLAS_STATUS_SUCCESS))
    {
        fprintf(stderr,"Error in cublasSetMatrix()\n");
        fprintf(stderr,"CUBLAS error code cA: ");
        cublaserrorcode(status1);
        fprintf(stderr,"\n");
        fprintf(stderr,"CUBLAS error code cB: ");
        cublaserrorcode(status2);
        fprintf(stderr,"\n");
        fprintf(stderr,"CUBLAS error code cC: ");
        cublaserrorcode(status3);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    //dgemm computation
    status1 = cublasDgemm(handle,TRANSA,TRANSB,N,N,N,&ab,cA,N,cB,N,&ab,cC,N);
    if(status1!=CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr,"Error in cublasDgemm()\n");
        fprintf(stderr,"CUBLAS error code: ");
        cublaserrorcode(status1);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    //result recovery
    status1 = cublasGetMatrix(N,N,(int)sizeof(*C),(const void*)cC,N,(void*)C,N);
    if(status1!=CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr,"Error in cublasGetMatrix()\n");
        fprintf(stderr,"CUBLAS error code: ");
        cublaserrorcode(status1);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    //memory free
    status1 = cudaFree((void*)cA);
    status2 = cudaFree((void*)cB);
    status3 = cudaFree((void*)cC);
    if((Status1!=cudaSuccess)||
       (Status2!=cudaSuccess)||
       (Status3!=cudaSuccess))
    {
        fprintf(stderr,"Error in cudaFree()\n");
        fprintf(stderr,"CUDA error code cA: ");
        cudaerrorcode(Status1);
        fprintf(stderr,"\n");
        fprintf(stderr,"CUDA error code cB: ");
        cudaerrorcode(Status2);
        fprintf(stderr,"\n");
        fprintf(stderr,"CUDA error code cC: ");
        cudaerrorcode(Status3);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    //destroy cublas instance
    status1 = cublasDestroy(handle);
    if(status1!=CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr,"Error in cublasDestroy()\n");
        fprintf(stderr,"CUBLAS error code: ");
        cublaserrorcode(status1);
        fprintf(stderr,"\n");
        exit(EXIT_FAILURE);
    }
    //end of function
    return;
}

void cudaerrorcode(cudaError_t status)
{
    if(status==cudaSuccess)
    {
        fprintf(stderr,"CUDA SUCCESS");
    }
    else if(status==cudaErrorMemoryAllocation)
    {
        fprintf(stderr,"'cudaErrorMemoryAllocation'");
    }
    else if(status==cudaErrorInvalidDevicePointer)
    {
        fprintf(stderr,"'cudaErrorInvalidDevicePointer'");
    }
    else if(status==cudaErrorInitializationError)
    {
        fprintf(stderr,"'cudaErrorInitializationError'");
    }
    else
    {
        fprintf(stderr,"UNKNOWN CUDA ERROR");
    }
    //end of function
    return;
}

void cublaserrorcode(cublasStatus_t status)
{
    if(status==CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr,"CUBLAS SUCCESS");
    }
    else if(status==CUBLAS_STATUS_NOT_INITIALIZED)
    {
        fprintf(stderr,"'CUBLAS_STATUS_NOT_INITIALIZED'");
    }
    else if(status==CUBLAS_STATUS_ALLOC_FAILED)
    {
        fprintf(stderr,"'CUBLAS_STATUS_ALLOC_FAILED'");
    }
    else if(status==CUBLAS_STATUS_INVALID_VALUE)
    {
        fprintf(stderr,"'CUBLAS_STATUS_INVALID_VALUE'");
    }
    else if(status==CUBLAS_STATUS_ARCH_MISMATCH)
    {
        fprintf(stderr,"'CUBLAS_STATUS_ARCH_MISMATCH'");
    }
    else if(status==CUBLAS_STATUS_MAPPING_ERROR)
    {
        fprintf(stderr,"'CUBLAS_STATUS_MAPPING_ERROR'");
    }
    else if(status==CUBLAS_STATUS_EXECUTION_FAILED)
    {
        fprintf(stderr,"'CUBLAS_STATUS_EXECUTION_FAILED'");
    }
    else if(status==CUBLAS_STATUS_INTERNAL_ERROR)
    {
        fprintf(stderr,"'CUBLAS_STATUS_INTERNAL_ERROR'");
    }
    else
    {
        fprintf(stderr,"UNKNOWN CUBLAS ERROR");
    }
    //end of function
    return;
}

CUBLAS_STATUS_MAPPING_ERROR means the copy from the device to the host failed. CUBLAS functions are usually wrappers around one or several kernel launches. Kernels launches are asynchronous, and errors occuring during kernel execution are thus reported on the next synchronous operation, which often is a cudaMemcpy() or cudaMemcpy2D() which is what cublasGetMatrix() maps to.

Given your hardware, I think the DGEMM kernel may be triggering the operating system watchdog timer, which then kills the kernel, causing the subsequent copy to fail. You can check into the execution time of the kernels with the CUDA profiler If you see it approaching 2 seconds or so before it fails, that would be a good indication that my hypothesis is correct. The watchdog timer is there for the purpose of keeping the GUI responsive. Running without a GUI (e.g. by not starting X) will allow your app to occupy the GPU with a compute kernel for as long as it pleases.

Here is what you would see for the case of an out-of-memory condition:

M=N=K= 14500
Computation 0 Error in cudaMalloc()
CUDA error code cA: CUDA SUCCESS
CUDA error code cB: CUDA SUCCESS
CUDA error code cC: ‘cudaErrorMemoryAllocation’

First of all, thank you very much for your answer

So the error is not in cublasGetMatrix() but in cublasDgemm() or any kernel it launches?

But the problem, is in the hardware itself? Or is in the nVidia driver? or in CUDA?

Sorry, but I didn’t understand this part. How can I use the profiler? I always run the program without GUI. I connect to my workstation via ssh and run my program. When the error occurs, the system is blocked and the only solution consists in reboot the machine.
The most strange behavior is that before the error the function runs OK three times.

Is this an execution of the program in your hardware?

Yes, that last snippet is from an execution of your (slightly modified) code on my workstation, to rule out the possibility that your issue is caused by an out-of-memory condition. As you can see, an out-of-memory condition would result in an error message at allocation time.

Based on prior experience with CUBLAS_STATUS_MAPPING_ERROR in apps using CUBLAS, the next working hypothesis is that the GEMM kernel is killed by the watchdog timer. Since this is an asynchronous event, it would cause the next synchronous CUDA API call to report the error. In your code, that would be the CUDA API calls inside cublasGetMatrix().

The watchdog timer is a feature of all operating systems supported by CUDA whose purpose is to ensure that the GUI does not freeze up for extened periods of time. Since the GPU can, at any given time, either service the GUI or run a CUDA kernel, lengthy CUDA kernels can block graphics, causing the watchdog timer to trigger. The time limit is in the 2-5 second range if I recall correctly.

Since the GTX 550 Ti has relatively low double-precision throughput, I consider the watchdog issue the likely root cause of your issue. To confirm the hypothesis, check the execution time of the DGEMM kernels. To do so, export CUDA_PROFILE=1, this will dump a log showing the kernel execution times. Since your app uses matrices of increasing size, you should see a sequence of increasing kernl execution times. If you find that these times start approaching 2 seconds for the largest matrices, this would confirm that the problem is with the watchdog timer. Don’t forget to disable the profiler with unset CUDA_PROFILE when you are done with the experiment.

The watchdog timer is enabled whenever your machine is running with a GUI, regardless of whether your app uses the GUI. For a Linux system, one typically boots the system to the command prompt, then logs in, then runs startx to start the GUI (graphical desktop) if desired. If you don’t run startx, there will be no GUI running and thus no watchdog timer, allowing you to run CUDA kernels of any length.

Hello:

First of all, thank you for the suggestion of CUDA_PROFILE=1.
I’ve performed two tests. In the first one, the program crash with matrices of dimensions 6500 and in the second one at the fourth computation with domensions 7500.

The results in the log file before the crash for M=N=K=7500:

method=[ memcpyHtoD ] gputime=[ 1.920 ] cputime=[ 9.000 ] 
method=[ memcpyHtoD ] gputime=[ 581893.625 ] cputime=[ 580478.000 ] 
method=[ memcpyHtoD ] gputime=[ 581875.938 ] cputime=[ 580461.000 ] 
method=[ memcpyHtoD ] gputime=[ 581893.375 ] cputime=[ 580477.000 ] 
method=[ _Z25magma_lds128_dgemm_kernelILb0ELb0ELi5ELi5ELi3ELi3ELi3EEviiiPKdiS1_iPdiiiS1_S1_ddi ] gputime=[ 15221644.000 ] cputime=[ 15.000 ] occupancy=[ 0.333 ] 
method=[ memcpyDtoH ] gputime=[ 539885.250 ] cputime=[ 15761382.000 ] 
method=[ memcpyHtoD ] gputime=[ 1.920 ] cputime=[ 9.000 ] 
method=[ memcpyHtoD ] gputime=[ 581875.312 ] cputime=[ 580462.000 ] 
method=[ memcpyHtoD ] gputime=[ 581893.500 ] cputime=[ 580477.000 ] 
method=[ memcpyHtoD ] gputime=[ 581878.375 ] cputime=[ 580461.000 ]

The first memcoyDtoH I suppose corresponds to cublasCreate() and the others are for cublasSetMatrix(). Apparently, the calls to cudaMalloc() do not appear. Then the call to DGEMM and finally the call to cublasGetMatrix() (memcpyDtoH). Then a new iteration and calls to cublasCreate() and cublasSetMatrix(), an finally the crash (CUBLAS_STATUS_MAPPING_ERROR). The last DGEMM call does not appears.

The results in the log file before the crash for M=N=K=6500:

method=[ memcpyHtoD ] gputime=[ 1.920 ] cputime=[ 10.000 ] 
method=[ memcpyHtoD ] gputime=[ 437057.406 ] cputime=[ 435412.000 ] 
method=[ memcpyHtoD ] gputime=[ 437051.906 ] cputime=[ 435403.000 ] 
method=[ memcpyHtoD ] gputime=[ 437063.938 ] cputime=[ 435414.969 ] 
method=[ _Z25magma_lds128_dgemm_kernelILb0ELb0ELi5ELi5ELi3ELi3ELi3EEviiiPKdiS1_iPdiiiS1_S1_ddi ] gputime=[ 9937489.000 ] cputime=[ 16.000 ] occupancy=[ 0.333 ] 
method=[ memcpyDtoH ] gputime=[ 405493.875 ] cputime=[ 10342919.000 ] 
method=[ memcpyHtoD ] gputime=[ 1.888 ] cputime=[ 9.000 ] 
method=[ memcpyHtoD ] gputime=[ 437066.531 ] cpu

In this case the error occurs in the first call to cublasSetMatrix()

And two aditional questions:

Which are the units if time in gputime and cputime? The numbers are too high

I can see that for small matrices the DGEMM kernel called is _Z20fermiDgemm_v3_kernel… but big ones is _Z25magma_lds128_dgemm_kernel… Uses CUBLAS internally the MAGMA DGEMM function? (http://icl.cs.utk.edu/magma/index.html)

Cheers

The timing numbers returned by profiler are in microseconds (one millionth of a second). I assume this is documented somewhere but I can’t point you at the relevant document right now.

Based on the log above, the execution time for dimension 6500 is 9.937489 seconds, the execution time for dimension 7500 is 15.221644 seconds. Since execution time increases with the cube of the matrix dimensions, this makes sense, since (7500/6500)**3 = 1.536 and 15.222 / 9.937 = 1.532.

This data would appear to invalidate my “kernel killed by watchdog timer” hypothesis, as very long running kernels complete. As I recall, your code executes 10 repetitions for each matrix size, and if I understand your latest update correctly, the app does not fail on the first instance of DGEMM with dimension 7500, but on a subsequent iteration?

I cannot think of a hypothesis that would jibe with that observation. I would suggest filing a bug, via the registered developer website, attaching your repro app and noting the specifics of your hardware configuration.

Individual CUBLAS functions often map to a variety of different kernels depending on GPU architecture and input arguments, as one can easily see from profiler output. Some of these kernels are based on third-party code. The CUBLAS documentation PDF lists copyright notices for third-party code included in CUBLAS in appendix C. See http://docs.nvidia.com/cuda/pdf/CUDA_CUBLAS_Users_Guide.pdf. The equivalent section of the CUBLAS online documentation is here: http://docs.nvidia.com/cuda/cublas/index.html#appendix-copyrights

Thank you again for your answer:

Correct

Where can I submit the possible bug? I’ve registered in https://developer.nvidia.com/ but I can’t find any “submit bug report” or similar in the menus

login at: https://developer.nvidia.com/user/register
click the green link: “CUDA/GPU Computing Registered Developer Program”
click the green link: “The Submit a Bug Form” under the heading “REPORTING AN ISSUE”

Thank you again

Since your app uses matrices of increasing size, you should see a sequence of increasing kernl execution times. has coaching in chandigarh