CUDA program gives cudaErrorIllegalAddress on GK110 GPUs, but runs on fine on other GPUs

Hi!
I’m having a very weird problem with my program. Essentially I’m doing a matrix multiplication on part of a matrix. The program apparently runs fine on most cards cards but crashes on sm_35 Kepler (=GK110) cards.

The initial program was written in PyCUDA (see here for the pycuda mailing list entry), but I’ve since managed to boil it down to the following minimal example written in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

int main(int argc, char **argv)
{
    cudaError_t status;
    cublasStatus_t status_blas;
    CUresult status_drv;
    float *A = 0;
    float *B = 0;
    float *C = 0;
    float alpha = 1.0f;
    float beta = 0.0f;
    float *oldA, *oldB, *oldC;
    cublasHandle_t handle;
    int n = 131;
    int m = 2483;
    int k = 3;
    int i;
    CUcontext ctx;
    
    cuInit(0);
    status_drv = cuCtxCreate(&ctx, 0, 0);
    if (status_drv != CUDA_SUCCESS) {
        fprintf(stderr, "!!!! Context creation error: %d\n", status);
        return EXIT_FAILURE;
    }
    status_blas = cublasCreate(&handle);
    if (status_blas != CUBLAS_STATUS_SUCCESS) {
        fprintf(stderr, "!!!! CUBLAS initialization error\n");
        return EXIT_FAILURE;
    }
    
    for (i = 0; i < 5; ++i) {
        printf("Iteration %d\n", i);
        if (cudaMalloc((void **)&B, m * k * sizeof(B[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate B)\n");
            return EXIT_FAILURE;
        }
        if (cudaMalloc((void **)&C, m * m * sizeof(C[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate C)\n");
            return EXIT_FAILURE;
        }
        if (cudaMalloc((void **)&A, n * m * sizeof(A[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate A)\n");
            return EXIT_FAILURE;
        }
        int s = 3;
        float * A_slice = A + 128*m;
        status_blas = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, m, s,
            &alpha, A_slice, m, B, k, &beta, C, m);
        if (status_blas != CUBLAS_STATUS_SUCCESS) {
            fprintf(stderr, "!!!! kernel execution error.\n");
            return EXIT_FAILURE;
        }
        if (i == 0) {
            oldA = A;
            oldB = B;
            oldC = C;
        } else if (i == 1) {
            status = cudaFree(oldA);
            if (status != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free A, %d)\n", status);
                return EXIT_FAILURE;
            }
            if (cudaFree(oldB) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free B)\n");
                return EXIT_FAILURE;
            }
            if (cudaFree(oldC) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free C)\n");
                return EXIT_FAILURE;
            }
        }
    }
    cublasDestroy(handle);
    cuCtxDestroy(ctx);
    return 0;
}

I only free memory in the 2nd iteration of the for loop to mimic the behavior of the original python program. The program will crash in the 2nd iteration of the for-loop when it tries to free “A”, with cudaFree returning a cudaErrorIllegalAddress error.

Concretely, the was tried on the following cards:

  • NVS 5400M -> no issues
  • GTX560Ti -> no issues
  • Tesla S2050 -> no issues
  • unknown sm_30 card (see stackoverflow thread) -> no issues
  • K40c -> CRASH
  • GTX 780 -> CRASH
  • K20m -> CRASH

I used a number of Linux machines with different distributions, some of them using CUDA 5.5 and some using CUDA 6.0. At least on the machines I have direct control over, all cards were using the 331 nvidia driver series.

There are several things to note here:

  • the order of the malloc calls matters. If I allocate A before B things run fine
  • the numerical constants matter a bit. For some values (e.g. n=30) no crash occurs, for others there is a crash
  • The order of the free/malloc calls matter. If I free the memory in the same iteration where I allocate, everything works just fine

At this point I’m pretty desperate. I don’t see why or where I’m doing anything wrong. If anyone could help me, I’d really appreciate it.

(Stackoverflow thread: http://stackoverflow.com/questions/24671820/cuda-program-gives-cudaerrorillegaladdress-on-sm-35-kepler-gpus-but-runs-on-fin?noredirect=1#comment38252831_24671820)