Choosing Convolution Algo in cuDNN v2

schetlur · March 24, 2015, 5:27pm

Various options are available in cuDNN version 2 for the algorithm used in the forward convolution function – these are described in the cudnnConvolutionFwdAlgo_t enum in cudnn.h. All of these options are available to the user via the same cudnnConvolutionForward interface, which has been updated to include an additional parameter for algorithm choice.
Some of these algorithms require the use of an extra “workspace” buffer that must be allocated by the user, and passed to the cudnnConvolutionForward call as an input argument. No restrictions are placed on the contents of the workspace buffer, and it may be used uninitialized in calls to cuDNN functions.
The size of the necessary workspace depends on both the choice of algorithm and the problem dimensions, and can be determined via the cudnnGetConvolutionForwardWorkspaceSize function. Please note that the result returned from this function is not guaranteed to result in successful allocation of the memory buffer at run time, and is subject to the failure cases associated with any call to cudaMalloc. Using a workspace smaller than this value will result in a run-time error code returned from cudnnConvolutionForward. Using a workspace larger than the returned value will work just fine.
The most accurate method to determine which algorithm offers the best performance for a given problem size is to try out all choices individually and compare their execution times. However, cuDNN provides a function - cudnnGetConvolutionForwardAlgorithm to suggest an algorithm based on some internal heuristics. This function only provides non-binding suggestions for which option to use. cudnnConvolutionForward strictly adheres to the algorithm choice passed to it as an input argument, which may or may not be the same as the suggestion made by cudnnGetConvolutionForwardAlgorithm.
The choice of algorithm can be tailored to memory specifications using options in the cudnnConvolutionFwdPreference_t enum. CUDNN_CONVOLUTION_FWD_PREFER_FASTEST suggests the fastest algorithm irrespective of memory overhead, and CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT allows the user to specify a pre-determined memory budget to operate within. Note that the memory budget used here only refers to true memory overhead, and does not cover the memory needed for storing the inputs and output.
We envision three broad use-cases for this API design.

The user has no restrictions on memory overhead other than the limits of run-time memory allocation. In this case, the code may involve cuDNN calls in this order:
a. Determine algorithm choice using cudnnGetConvolutionForwardAlgorithm (… , CUDNN_CONVOLUTION_FWD_PREFER_FASTEST, &algo, …)
b. Determine necessary workspace size, cudnnGetConvolutionForwardWorkspaceSize(… , algo, &workspaceSize,…)
c. Allocate workspace, cudaStat = cudaMalloc(… , workspaceSize, …)
d. Check for errors – if (cudaStat != CUDA_SUCCESS)
e. Call the convolution function, cudnnConvolutionForward ( … , Workspace, algo, …)
The user is operating within a pre-determined memory overhead limit. The code here may look more like:
a. Determine and safely allocate workspaceLimit bytes up front.
b. Determine algorithm choice using cudnnGetConvolutionForwardAlgorithm (… , CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, workspaceLimit, &algo, …)
c. Call the convolution function, cudnnConvolutionForward ( … , Workspace, algo, …)
The user has made all memory utilization and algorithm choices in advance. This code would directly call cudnnConvolutionForward ( … , Workspace, algo, …) with Workspace safely pre-allocated, and algo determined statically ahead of time.