cudnnConvolutionBackwardFilter crashes the system

hamster143 · July 21, 2018, 2:36am

I’m working on a deep learning based speech recognition app and I’m in process of adding CuDNN as a backend. I got so far as to be able to do full feedforward and backpropagation and get correct numbers, except for the last level of backpropagation, which, for some reason, crashes the system. By which I mean, no error code is reported, but, half of the time, I get a “driver stopped responding and had to be restarted”, the other half, the screen goes dark and the system (Windows) has to be rebooted. I’m on NVIDIA GeForce 980 Ti, latest drivers (398.36), CUDA 9.2, CuDNN 7.1.

I’ve managed to isolate the crashing code. Here goes:

void create_2d_tensor(cudnnTensorDescriptor_t& desc, int w, int h)
{
	cudnnStatus_t status;
	
	status = cudnnCreateTensorDescriptor(&desc);
	if (status != CUDNN_STATUS_SUCCESS)
		printf("cudnnCreateTensorDescriptor fail %d\n", status);
	int dims[4], strides[4];

	dims[0] = 1;
	dims[1] = h;
	dims[2] = 1;
	dims[3] = w;
	strides[0] = w*h;
	strides[1] = w;
	strides[2] = w;
	strides[3] = 1;
	cudnnSetTensorNdDescriptor(desc, CUDNN_DATA_FLOAT, 4, dims, strides);
	if (status != CUDNN_STATUS_SUCCESS)
		printf("cudnnSetTensorNdDescriptor fail %d\n", status);
}

void crash(cudnnHandle_t handle)
{
	cudaError_t err;
	cudnnStatus_t status;

	int wIn = 100000000;

	cudnnTensorDescriptor_t y0, y1;
	cudnnFilterDescriptor_t fdesc;
	cudnnConvolutionDescriptor_t cdesc;

	create_2d_tensor(y0, wIn, 1);

	cudnnCreateFilterDescriptor(&fdesc);
	status = cudnnSetFilter4dDescriptor(fdesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 4, 1, 1, 8);
	cudnnCreateConvolutionDescriptor(&cdesc);
	status = cudnnSetConvolution2dDescriptor(cdesc, 0, 0, 1, 2, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT);

	int _n, _c, _h, _w;
	cudnnGetConvolution2dForwardOutputDim(cdesc, y0, fdesc, &_n, &_c, &_h, &_w);
	printf("CUDNN reports output %d x %d x %d x %d\n", _n, _c, _h, _w);

	int wOut = _w;

	create_2d_tensor(y1, wOut, 4);

	float* pdx = 0, *py = 0;

	size_t alloc_size0 = wIn * 4;
	size_t alloc_size1 = wOut * 4 * 4;

	err = cudaMalloc(&py, alloc_size0);
	cudaMemset(py, 0, alloc_size0);

	err = cudaMalloc(&pdx, alloc_size1);
	cudaMemset(pdx, 0, alloc_size1);

	float* pw_delta = 0;
	cudaMalloc(&pw_delta, 8 * 4 * 4);
	cudaMemset(pw_delta, 0, 8 * 4 * 4);

	float blend[2] = { 1.0, 0.0 };
	status = cudnnConvolutionBackwardFilter(handle, blend,
		y0, py,
		y1, pdx,
		cdesc, CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3, 0, 0, blend + 1,
		fdesc,
		pw_delta);
	printf("CUDNN status %d\n", status);
	cudaDeviceSynchronize();
	err = cudaGetLastError();
	printf("CUDA error %d\n", err);
}

This is a convolution with kernel width 8, stride 2, 4 output features. The crash seems to happen only if wIn is sufficiently large (the threshold on my system is somewhere between 50M and 100M elements), but it is not related either to the availability of graphics memory (at 100M elements, I only allocate 1.2 GB of memory) or to TDR (with 50M elements, the code executes in 40 milliseconds).

I tried the algorithm ALGO_0 and it crashes too.

hamster143 · July 21, 2018, 9:40am

Actually it’s weirder than that.

I tried to switch to stride 1, guessing that it might be a poorly covered corner case. Instead of helping, it made things worse: the code started crashing at an even lower input size.

Then I noticed something odd. Up to a certain input size, backprop works just fine, executing in about 40 milliseconds.
Past that critical value (which looks to be 33553928 - almost, but not exactly 2^25), it abruptly becomes almost 100 times slower: from 40 milliseconds all the way to 4 seconds (while still returning good data!) So, the crash may be indirectly caused by TDR after all, but only because something goes wrong internally and backprop becomes extremely slow.

The profiler informs me that, in both cases, the kernel that does the job is cudnn::detail::wgrad_alg0_engine<float,int=512,int=6,int=5,int=3,int=3,int=3,bool=1,int=512>. However, below the critical value, it is executed with block size [8,8,1] and grid size [1,1,65535]. And, above the critical value, that becomes [8,8,1] and [1,1,1].

It is clearly a CuDNN bug.

m001 · September 5, 2018, 3:33am

removed

Topic		Replies	Views
cuDNN v6.0 failure of filter and workspace initialization for 3D convolution (CUDNN_STATUS_NOT_SUPPORTED) GPU-Accelerated Libraries	0	1656	May 3, 2017
cuDNN runs pretty slow cuDNN	2	1004	April 24, 2023
Cudnn8 crash with stack overflow cuDNN	11	4032	April 15, 2023
Unexpected cudnnConvolutionForward performance with varying input channels cuDNN	2	578	June 2, 2020
Critical Defect: call to cudnnConvolutionForward() leading to cudaDeviceSynchronize() ret error 77 cuDNN	4	905	December 21, 2018
Slow cudnn Convolution Bias Activation Forward for grouped convolution cuDNN	2	1032	April 8, 2020
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize cuDNN	29	51641	October 12, 2021
The convoluion was slow in cudnn 8.4.0 and 8.4.1 cuDNN	6	844	September 17, 2022
peculiar return values of cudnnGetConvolutionForwardAlgorithm function GPU-Accelerated Libraries	0	1095	November 7, 2016
Error creating engineConfig with cuDNN Graph API in custom implementation cuDNN cudnn	1	65	August 30, 2024

cudnnConvolutionBackwardFilter crashes the system

Related topics