cudnnConvolutionBackwardFilter crashes the system

I’m working on a deep learning based speech recognition app and I’m in process of adding CuDNN as a backend. I got so far as to be able to do full feedforward and backpropagation and get correct numbers, except for the last level of backpropagation, which, for some reason, crashes the system. By which I mean, no error code is reported, but, half of the time, I get a “driver stopped responding and had to be restarted”, the other half, the screen goes dark and the system (Windows) has to be rebooted. I’m on NVIDIA GeForce 980 Ti, latest drivers (398.36), CUDA 9.2, CuDNN 7.1.

I’ve managed to isolate the crashing code. Here goes:

void create_2d_tensor(cudnnTensorDescriptor_t& desc, int w, int h)
	cudnnStatus_t status;
	status = cudnnCreateTensorDescriptor(&desc);
	if (status != CUDNN_STATUS_SUCCESS)
		printf("cudnnCreateTensorDescriptor fail %d\n", status);
	int dims[4], strides[4];

	dims[0] = 1;
	dims[1] = h;
	dims[2] = 1;
	dims[3] = w;
	strides[0] = w*h;
	strides[1] = w;
	strides[2] = w;
	strides[3] = 1;
	cudnnSetTensorNdDescriptor(desc, CUDNN_DATA_FLOAT, 4, dims, strides);
	if (status != CUDNN_STATUS_SUCCESS)
		printf("cudnnSetTensorNdDescriptor fail %d\n", status);

void crash(cudnnHandle_t handle)
	cudaError_t err;
	cudnnStatus_t status;

	int wIn = 100000000;

	cudnnTensorDescriptor_t y0, y1;
	cudnnFilterDescriptor_t fdesc;
	cudnnConvolutionDescriptor_t cdesc;

	create_2d_tensor(y0, wIn, 1);

	status = cudnnSetFilter4dDescriptor(fdesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 4, 1, 1, 8);
	status = cudnnSetConvolution2dDescriptor(cdesc, 0, 0, 1, 2, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT);

	int _n, _c, _h, _w;
	cudnnGetConvolution2dForwardOutputDim(cdesc, y0, fdesc, &_n, &_c, &_h, &_w);
	printf("CUDNN reports output %d x %d x %d x %d\n", _n, _c, _h, _w);

	int wOut = _w;

	create_2d_tensor(y1, wOut, 4);

	float* pdx = 0, *py = 0;

	size_t alloc_size0 = wIn * 4;
	size_t alloc_size1 = wOut * 4 * 4;

	err = cudaMalloc(&py, alloc_size0);
	cudaMemset(py, 0, alloc_size0);

	err = cudaMalloc(&pdx, alloc_size1);
	cudaMemset(pdx, 0, alloc_size1);

	float* pw_delta = 0;
	cudaMalloc(&pw_delta, 8 * 4 * 4);
	cudaMemset(pw_delta, 0, 8 * 4 * 4);

	float blend[2] = { 1.0, 0.0 };
	status = cudnnConvolutionBackwardFilter(handle, blend,
		y0, py,
		y1, pdx,
		cdesc, CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3, 0, 0, blend + 1,
	printf("CUDNN status %d\n", status);
	err = cudaGetLastError();
	printf("CUDA error %d\n", err);

This is a convolution with kernel width 8, stride 2, 4 output features. The crash seems to happen only if wIn is sufficiently large (the threshold on my system is somewhere between 50M and 100M elements), but it is not related either to the availability of graphics memory (at 100M elements, I only allocate 1.2 GB of memory) or to TDR (with 50M elements, the code executes in 40 milliseconds).

I tried the algorithm ALGO_0 and it crashes too.

Actually it’s weirder than that.

I tried to switch to stride 1, guessing that it might be a poorly covered corner case. Instead of helping, it made things worse: the code started crashing at an even lower input size.

Then I noticed something odd. Up to a certain input size, backprop works just fine, executing in about 40 milliseconds.
Past that critical value (which looks to be 33553928 - almost, but not exactly 2^25), it abruptly becomes almost 100 times slower: from 40 milliseconds all the way to 4 seconds (while still returning good data!) So, the crash may be indirectly caused by TDR after all, but only because something goes wrong internally and backprop becomes extremely slow.

The profiler informs me that, in both cases, the kernel that does the job is cudnn::detail::wgrad_alg0_engine<float,int=512,int=6,int=5,int=3,int=3,int=3,bool=1,int=512>. However, below the critical value, it is executed with block size [8,8,1] and grid size [1,1,65535]. And, above the critical value, that becomes [8,8,1] and [1,1,1].

It is clearly a CuDNN bug.