cuBLAS causing races on Maxwell (code to reproduce it included)

#include <cstdlib>
#include <cstdio>
#include <cuda_runtime.h>
#include <cublas_v2.h>

int main(int argc, char** argv)
{
    int dev = 0;
    if(argc > 1)
        dev = std::atoi(argv[1]);
    std::printf("dev = %d\n", dev);

    cudaStream_t s;
    cublasHandle_t h;

    float *a, *b, *c;
    std::size_t n = 4000;

    cudaSetDevice(dev);

    cudaMalloc(&a, n * n * sizeof(float));
    cudaMalloc(&b, n * n * sizeof(float));
    cudaMalloc(&c, n * n * sizeof(float));

    cudaStreamCreate(&s);
    cublasCreate(&h);

    cublasSetStream(h, s);
    float alpha = 1, beta = 1;

    cublasSgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a, n, b, n, &beta, c, n);

    cudaStreamSynchronize(s);
    cublasDestroy(h);
    cudaStreamDestroy(s);
}

To reproduce:

g++ -I/usr/local/cuda/include cublas_race.cpp -L/usr/local/cuda/lib64 -lcudart -lcublas
cuda-memcheck --tool racecheck ./a.out 0 # choose GTX 980 here

I tried this a few times on several GTX 980 cards, and one K40. GTX’s usually showed screenfuls of

========= Race reported between Read access at 0x00002570 in maxwell_sgemm_128x128_nn
=========     and Write access at 0x00002548 in maxwell_sgemm_128x128_nn [406 hazards]

but the number of errors varies each run, and sometimes it’s 0. K40 never showed any races.

This is on Ubuntu 14.04 with CUDA 6.5.

Questions:

  1. Can you reproduce this? (So that I'm not the only one reporting the issue)
  2. Is this a problem with cuBLAS or racecheck?
  3. Has this been fixed in CUDA 7.0? I couldn't find any cuBLAS release notes online.

Well, there is always someone who is the first to report an issue :-) NVIDIA has staff for reproducing customer-reported issues. Before you file a bug you would want to try CUDA 7.0 though to see whether the issue has already been addressed.

I have a hazy notion that this issue (or one very similar to this) was brought up in this forum before. I am not very confident in my recollection though, but it may have been a case where the race condition involves data that is not actually contributing to the final result (which does not mean the issue should not be addressed). Maybe txbob remembers whether this came up before and if so, what the details were.

The CUDA FAQ says posting in the forum is enough – no need to file bug reports.

FWIW, with the same parameters, the results of SGEMM always agree with OpenBLAS for me, so I don’t know if this is a false positive for racecheck, or the kind of data race that does not affect the result, or something else:

A more complex application I’m working on crashes under racecheck, but not without it. It runs fine with memcheck too.

Those race conditions are harmless and will not effect the output. At the end of the gemm op some warp synchronous shared memory shuffling is done in order to get coalesced writes back to global. This means that the region of memory being written to and subsequently read from is the same within a warp. Because the shared memory instructions are guaranteed to be executed in sass order, there is no need to synchronize them. The racecheck tool is not coded to check that the read and write addresses meet this condition and so it is reported.

So if anything, this is a bug in racecheck.

I think the FAQ is overly optimistic, to be generous. No idea who wrote the FAQ, but I do not think there is any guarantee that bug reports in the forums will be spotted by NVIDIA.

The CUDA FAQ page has now been edited. I believe it should now consistently present the idea that bugs are to be filed using the bug reporting system available to registered developers.

reporting bugs using these forums (only) is a less robust method.

Hi Alex- It would be great if you can file the bug here https://developer.nvidia.com/nvbugs/cuda/add

Thanks!

Thanks for the explanation!