cublasDgemm fails if executed repeatedly

I ran into a major problem using cublasDgemm. I use it many time and found that it will completly hang the display driver after a while, the problem seems not to be completely deterministic, so.

The returnded cuda error is: “the launch timed out and was terminated”

My setup is a GTX 280 using CUDA 2.0 Beta 2 on Ubuntu 7.10.

Attached is a small modification of simpleCUBLAS that reproduces the error. Just copy it into the projects folder of the SDK (2.0 Beta 2) compile and execute.

File cublasDgemmSmoke.c

/* This example demonstrates failure of cublasDgemm on repeated execution.

 * In my setup cublasDgemm will reliable fail after about 50 - 100 executions

 * with the message "the launch timed out and was terminated".

 * The problem appears with plain cublas, direct calls to cuda only exist

 * because cublas masks the original error.


/* Includes, system */

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

/* Includes, cuda */

#include "cublas.h"

#include "cuda_runtime.h"

/* Matrix size */

#define N  (2000)

#define I  (1000)

/* Main */

int main(int argc, char** argv)


    cublasStatus status;

    double* h_A;

    double* h_B;

    double* h_C;

    double* d_A = 0;

    double* d_B = 0;

    double* d_C = 0;

    double alpha = 1.0f;

    double beta = 0.0f;

    int n2 = N * N;

    int i;

  printf( "N = %i, I = %i\n", N, I );    

   /* Initialize CUBLAS */

    status = cublasInit();

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! CUBLAS initialization error\n");

        return EXIT_FAILURE;


   /* Allocate host memory for the matrices */

    h_A = (double*)malloc(n2 * sizeof(h_A[0]));

    if (h_A == 0) {

        fprintf (stderr, "!!!! host memory allocation error (A)\n");

        return EXIT_FAILURE;


    h_B = (double*)malloc(n2 * sizeof(h_B[0]));

    if (h_B == 0) {

        fprintf (stderr, "!!!! host memory allocation error (B)\n");

        return EXIT_FAILURE;


    h_C = (double*)malloc(n2 * sizeof(h_C[0]));

    if (h_C == 0) {

        fprintf (stderr, "!!!! host memory allocation error (C)\n");

        return EXIT_FAILURE;


   /* Fill the matrices with test data */

    for (i = 0; i < n2; i++) {

        h_A[i] = rand() / (double)RAND_MAX;

        h_B[i] = rand() / (double)RAND_MAX;

        h_C[i] = rand() / (double)RAND_MAX;


   /* Allocate device memory for the matrices */

    status = cublasAlloc(n2, sizeof(d_A[0]), (void**)&d_A);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device memory allocation error (A)\n");

        return EXIT_FAILURE;


    status = cublasAlloc(n2, sizeof(d_B[0]), (void**)&d_B);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device memory allocation error (B)\n");

        return EXIT_FAILURE;


    status = cublasAlloc(n2, sizeof(d_C[0]), (void**)&d_C);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device memory allocation error (C)\n");

        return EXIT_FAILURE;


   /* Initialize the device matrices with the host matrices */

    status = cublasSetMatrix(N, N, sizeof(h_A[0]), h_A, N, d_A, N);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device access error (write A)\n");

        return EXIT_FAILURE;


    status = cublasSetMatrix(N, N, sizeof(h_B[0]), h_B, N, d_B, N);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device access error (write B)\n");

        return EXIT_FAILURE;


    status = cublasSetMatrix(N, N, sizeof(h_C[0]), h_C, N, d_C, N);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device access error (write C)\n");

        return EXIT_FAILURE;



  for( i = 0; i < I; ++i ) {


     /* Clear last error */



     /* Performs operation using cublas */

     cublasDgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

     status = cublasGetError();

     if (status != CUBLAS_STATUS_SUCCESS) {

         fprintf (stderr, "!!!! kernel execution error.\n");

         return EXIT_FAILURE;


 	// make sure Dgemm is finished

  	cudaError_t cudaErr = cudaThreadSynchronize();

  	if( cudaErr != cudaSuccess ) {

    fprintf( stderr, "Dgemm failed on invocation %i! %s\n", i+1, cudaGetErrorString( cudaErr ) );



  	fprintf( stdout, "\r%i of %i iterations done", i+1, I );

  	fflush( stdout );


   /* Read the result back */

    status = cublasGetMatrix(N, N, sizeof(h_C[0]), d_C, N, h_C, N);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! device access error (read C)\n");

        return EXIT_FAILURE;


   /* Memory clean up */

    status = cublasFree(d_A);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! memory free error (A)\n");

        return EXIT_FAILURE;


    status = cublasFree(d_B);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! memory free error (B)\n");

        return EXIT_FAILURE;


    status = cublasFree(d_C);

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! memory free error (C)\n");

        return EXIT_FAILURE;





   /* Shutdown */

    status = cublasShutdown();

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! shutdown error (A)\n");

        return EXIT_FAILURE;


   if (argc <= 1 || strcmp(argv[1], "-noprompt")) {

        printf("\nPress ENTER to exit...\n");



    return EXIT_SUCCESS;



# Add source files here

EXECUTABLE	:= cublasDgemmSmoke

# Cuda source files (compiled with cudacc)


# C/C++ source files (compiled with gcc / c++)

CFILES Â := cublasDgemmSmoke.c

# Additional libraries needed by the project

USECUBLAS �  �  �  := 1


# Rules and targets

include ../../common/

Does this reproduce if you’re not running X?
Please generate and attach an nvidia-bug-report.log (as root by running while this problem is present.


X is a good point, as I have X running even though I connect to the mashine via SSH. Therefore I stopped gdm which stopped X. However, now the application seems to fail even faster (never got it to do more than 30 invocations). On the last attempt I killed it after an hour of not indicating any progress.

I ran before killing the application. The txt extension is because the forum wouldn’t let me upload the log.
nvidia_bug_report.txt (192 KB)

I’m attempting to reproduce this failure here, and I’m not getting very far. When I build & runt the app, it seemingly runs forever. How long does it normally take to complete one iteration?

Also, have you tested this after rebooting AND without starting X?

Thats the behaviour I saw without X.

I’ll try rebooting without X starting at all next. On device it should do 10+ iterations in 5 seconds, so the watchdog seems to be the saviour in the error case.

I removed gdm from the system and ran the application without X ever being run. The result is the same without X. Only difference was that I had to create the devices for the second and third card on the system by hand as those were not automatically created by the driver for some reason.

I slightly modified the program in my original post to show every iteration which lets one immediately notice when the program starts to fail. I left it stalled for 30+ min so, just to make sure.

On a related note. I also experimented with emulation note. I had it running over night, but it was still below 10 iterations, so I cannot say whether it would have worked.

You can also move the allocs, frees and memcpies out of the loop as that won’t change the behaviour ans slightly increase the speed.


it may not be related, but this issue exhibits the similar behavior to the long-standing bug I have in the system where many short (~2.0ms) kernel calls in a row causes random “launch timeout” or “unspecified launch failures” with X running. Without X, I observe an apparent kernel infinite loop as theMarix does with the program here (and thus it makes sense why there is a launch timeout with X…).

theMarix: I’d attempt to confirm your problem, but I don’t have any double precision cards :( I notice that you perform allocations and frees inside the big loop. Try moving them outside the loop so that the only thing in the loop is the call to cublasDgemm. That will make the most direct comparison to the similar behavior I have observed. Plus, many on the forums have noted problems with repeated allocation/deallocations on the GPU.

Thanks for the idea, however I already did that and the behaviour is the same. Actually started the investigation from only mallocs and frees, as that’s what I first thought was causing the problem. I had it like this in my original problem as I wanted to get a quick glance on whether it is working at all before moving the matrix to the card all over the program.

Update: I updated the program in the original post to the version without mallocs and frees in the inner loop. Additionally I created another nvidia-bug-report.log having rebooted without X.
nvidia_bug_report.txt (192 KB)

I updated to the 177.67-driver and non-beta 2.0 CUDA today. Seems like that solved the problem.