I’m working on a deep learning based speech recognition app and I’m in process of adding CuDNN as a backend. I got so far as to be able to do full feedforward and backpropagation and get correct numbers, except for the last level of backpropagation, which, for some reason, crashes the system. By which I mean, no error code is reported, but, half of the time, I get a “driver stopped responding and had to be restarted”, the other half, the screen goes dark and the system (Windows) has to be rebooted. I’m on NVIDIA GeForce 980 Ti, latest drivers (398.36), CUDA 9.2, CuDNN 7.1.
I’ve managed to isolate the crashing code. Here goes:
void create_2d_tensor(cudnnTensorDescriptor_t& desc, int w, int h)
{
cudnnStatus_t status;
status = cudnnCreateTensorDescriptor(&desc);
if (status != CUDNN_STATUS_SUCCESS)
printf("cudnnCreateTensorDescriptor fail %d\n", status);
int dims[4], strides[4];
dims[0] = 1;
dims[1] = h;
dims[2] = 1;
dims[3] = w;
strides[0] = w*h;
strides[1] = w;
strides[2] = w;
strides[3] = 1;
cudnnSetTensorNdDescriptor(desc, CUDNN_DATA_FLOAT, 4, dims, strides);
if (status != CUDNN_STATUS_SUCCESS)
printf("cudnnSetTensorNdDescriptor fail %d\n", status);
}
void crash(cudnnHandle_t handle)
{
cudaError_t err;
cudnnStatus_t status;
int wIn = 100000000;
cudnnTensorDescriptor_t y0, y1;
cudnnFilterDescriptor_t fdesc;
cudnnConvolutionDescriptor_t cdesc;
create_2d_tensor(y0, wIn, 1);
cudnnCreateFilterDescriptor(&fdesc);
status = cudnnSetFilter4dDescriptor(fdesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 4, 1, 1, 8);
cudnnCreateConvolutionDescriptor(&cdesc);
status = cudnnSetConvolution2dDescriptor(cdesc, 0, 0, 1, 2, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT);
int _n, _c, _h, _w;
cudnnGetConvolution2dForwardOutputDim(cdesc, y0, fdesc, &_n, &_c, &_h, &_w);
printf("CUDNN reports output %d x %d x %d x %d\n", _n, _c, _h, _w);
int wOut = _w;
create_2d_tensor(y1, wOut, 4);
float* pdx = 0, *py = 0;
size_t alloc_size0 = wIn * 4;
size_t alloc_size1 = wOut * 4 * 4;
err = cudaMalloc(&py, alloc_size0);
cudaMemset(py, 0, alloc_size0);
err = cudaMalloc(&pdx, alloc_size1);
cudaMemset(pdx, 0, alloc_size1);
float* pw_delta = 0;
cudaMalloc(&pw_delta, 8 * 4 * 4);
cudaMemset(pw_delta, 0, 8 * 4 * 4);
float blend[2] = { 1.0, 0.0 };
status = cudnnConvolutionBackwardFilter(handle, blend,
y0, py,
y1, pdx,
cdesc, CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3, 0, 0, blend + 1,
fdesc,
pw_delta);
printf("CUDNN status %d\n", status);
cudaDeviceSynchronize();
err = cudaGetLastError();
printf("CUDA error %d\n", err);
}
This is a convolution with kernel width 8, stride 2, 4 output features. The crash seems to happen only if wIn is sufficiently large (the threshold on my system is somewhere between 50M and 100M elements), but it is not related either to the availability of graphics memory (at 100M elements, I only allocate 1.2 GB of memory) or to TDR (with 50M elements, the code executes in 40 milliseconds).
I tried the algorithm ALGO_0 and it crashes too.