cublas cgemm bug? inspecting a simple call to cublas cgemm with cuda-memcheck crashes

Eric_Heinemeyer · October 31, 2011, 12:42pm

Hi,

lately I started to use the cuda-memcheck tool to look for out-of-bounds accesses etc…

I found that this check crashes my application. The problem seems to be in a call to the CUBLAS

cgemm routine. The result is calculated correctly and the program runs fine. However, if I run

cuda-memcheck I get a segmentation fault.

I’m running the program on a Tesla C2050 with CUDA 4.0.

It would be great if somebody could verify this problem.

Cheers,

Eric

The programm I used is the following:

#include <stdlib.h>

#include <string.h>

#include <stdio.h>

#include <cublas.h>

#include <cuda_runtime_api.h>

// Transposition flags

const char TransA = 'T';  const char TransB = 'N';

// Complex data type

typedef cuComplex Complex;

__inline__ void  CUDA_CHECK_ERROR(void)

{

    cudaError_t cuErr = cudaGetLastError();       

    if(cuErr != cudaSuccess)                            

        printf("%s\n", cudaGetErrorString(cuErr));          

}

__inline__ void  CUBLAS_CHECK_ERROR(void)

{

    cublasStatus_t cubErr = cublasGetError();      

    if(cubErr != CUBLAS_STATUS_SUCCESS) 

        printf("%i\n", cubErr);           

}  

void print_matrix (const Complex *A, const int lda, const int N, const int M) 

{

    for (int i=0; i < N; ++i) {

        for (int j=0; j < M; ++j) {

            printf("%8.2f+%8.2f*I\t",A[j*lda+i].x,A[j*lda+i].y);

        }

        printf("\n");

    }

}

int main(int argc, char* argv[]) {

const int M  = 2;

    const int ld = 4;

    const int Msize = ld*M;

    const Complex alpha = {.x=1.0, .y=2.0};

    const Complex beta  = {.x=1.0, .y=2.0};

Complex *A_D, *C_D;    // pointers to DEVICE memory 

    Complex *A,  *C;       // pointers to HOST   memory 

cublasInit();    CUBLAS_CHECK_ERROR();

// Allocate arrays on DEVICE    

    cudaMalloc ((void **) &A_D, (size_t)(sizeof(Complex) * Msize));     CUDA_CHECK_ERROR();

    cudaMalloc ((void **) &C_D, (size_t)(sizeof(Complex) * Msize));     CUDA_CHECK_ERROR();

// Allocate arrays on HOST

    A  = (Complex*) malloc((size_t)(sizeof(Complex) * Msize));

    C  = (Complex*) malloc((size_t)(sizeof(Complex) * Msize));

// Initialize host data

    for (int i=0; i < M; ++i) {

        for (int j=0; j < M; ++j) {

            A [j*ld + i].x = 2.0;   A [j*ld + i].y = 2.0;  

            C [j*ld + i].x = 4.0;   C [j*ld + i].y = 4.0;  

        }

    }

printf("\nMatrix A\n");                 print_matrix (A,ld,M,M);

    printf("\nMatrix C before CGEMM\n");    print_matrix (C,ld,M,M);

// Copy input-data from host memory to device memory

    cudaMemcpy (A_D, A, (size_t)(sizeof(Complex)*Msize), cudaMemcpyHostToDevice);    CUDA_CHECK_ERROR();

    cudaMemcpy (C_D, C, (size_t)(sizeof(Complex)*Msize), cudaMemcpyHostToDevice);    CUDA_CHECK_ERROR();

// call CGEMM-Kernel

    printf ("\nbefore CGEMM call\n");

    cublasCgemm (TransA,TransB,M,M,M,alpha,A_D,ld,A_D,ld,beta,C_D,ld);    CUBLAS_CHECK_ERROR(); 

    printf ("\nafter  CGEMM call\n");

// Copy result back to host

    cudaMemcpy (C, C_D, (size_t)(sizeof(Complex)*Msize), cudaMemcpyDeviceToHost);    CUDA_CHECK_ERROR();

printf("\nMatrix C after CGEMM \n");    print_matrix (C,ld,M,M);

// Clean-Up

    cudaFree (C_D); cudaFree (A_D);

    free (C);       free (A);

cublasShutdown ();    CUBLAS_CHECK_ERROR();

}

which was compiled with

CUDA_PATH     = /opt/cuda

CUDA_INC_PATH = $(CUDA_PATH)/include

CUDA_LIB_PATH = $(CUDA_PATH)/lib64

LIBRARY  = -L/usr/lib64  -L$(CUDA_LIB_PATH)

INCLUDE  =               -I$(CUDA_INC_PATH)

NVCC      = $(CUDA_PATH)/bin/nvcc  

NVCCFLAGS = -g -G -v -arch sm_20  -Xptxas="-v"  

NVCCOPS   = -gencode arch=compute_20,code=compute_20 --maxrregcount 63  -Xcompiler -std=gnu99 

all:    test 

test:  cgemm.cu 

	$(NVCC) $(NVCCFLAGS) $(NVCCOPS)  $(INCLUDE) $(LIBRARY) -lcublas -lcudart -o test cgemm.cu

clean:

	rm *.o  *.bak  test

Running it I get the following output:

~/cuda_test> test

Matrix A
2.00+    2.00*I	    2.00+    2.00*I	

2.00+    2.00*I	    2.00+    2.00*I	
Matrix C before CGEMM
4.00+    4.00*I	    4.00+    4.00*I	

4.00+    4.00*I	    4.00+    4.00*I	
before CGEMM call

after CGEMM call

Matrix C after CGEMM

-36.00+ 28.00I -36.00+ 28.00I

-36.00+ 28.00I -36.00+ 28.00I

which gives the correct result. If I run

~/cuda_test> cuda-memcheck test

========= CUDA-MEMCHECK

Matrix A
2.00+    2.00*I	    2.00+    2.00*I	

2.00+    2.00*I	    2.00+    2.00*I	
Matrix C before CGEMM
4.00+    4.00*I	    4.00+    4.00*I	

4.00+    4.00*I	    4.00+    4.00*I	
before CGEMM call

========= Error: process didn’t terminate successfully

========= ERROR SUMMARY: 0 errors

the program crashes and I see the following line in the kernel messages

njuffa · November 21, 2011, 6:42pm

Thanks for bringing this to our attention, and sorry for the inconvenience. Our debugger and driver teams have looked into this and were able to repro the issue with CUDA 4.0 and the r270 driver that goes along with it. According to the information I got from these teams, this issue should be fixed in CUDA 4.1 RC1 and the accompanying r285 driver, so if possible, I would recommend giving that a try.

Eric_Heinemeyer · November 22, 2011, 8:56am

Thanks a lot,

this means that yet another item can be removed from our checklist. ;-) At the moment we are debuging some other (curious) problems that we need to understand first before we can change our set up. This said, we will probably need some time before we can try CUDA 4.1 RC 1. We will keep you posted if this problem was eventually solved.

Cheers,

Eric

Topic		Replies	Views
CuBLAS GEMM leads to invalid reads for some matrix dimensions GPU-Accelerated Libraries	6	2171	August 27, 2014
cublasSgemm results in null matrix CUDA Programming and Performance	5	823	May 28, 2019
CGEMM problems CUDA Programming and Performance	14	6751	February 2, 2011
Compute-sanitizer error with cuBLAS Compute Sanitizer	4	382	December 11, 2024
cublasDgemm returns wrong results for large matrix dimensions? CUDA Programming and Performance	12	3323	November 30, 2010
cublasSgemm() alway fail during compute intensify task CUDA Programming and Performance	14	4726	January 8, 2015
cuda-memcheck : windows + cublas CUDA-MEMCHECK	2	1317	November 12, 2018
cgemm operation returns wrong result Error in C Code? CUDA Programming and Performance	8	1779	August 25, 2009
cuBLAS is crashing all of a sudden giving SIGSEGV error after cuEGLInit() is called GPU-Accelerated Libraries	0	1208	April 13, 2018
cuBLAS causing races on Maxwell (code to reproduce it included) CUDA Programming and Performance	7	1778	April 29, 2015

cublas cgemm bug? inspecting a simple call to cublas cgemm with cuda-memcheck crashes

Related topics