I had a observation with cudnnGetConvolutionForwardAlgorithm and suggest a change to it.
cudnnStatus_t cudnnGetConvolutionForwardAlgorithm(
cudnnHandle_t handle, //input
const cudnnTensorDescriptor_t xDesc, //input
const cudnnFilterDescriptor_t wDesc, //input
const cudnnConvolutionDescriptor_t convDesc, //input
const cudnnTensorDescriptor_t yDesc, //input
cudnnConvolutionFwdPreference_t preference, //input (why do we need this?)
size_t memoryLimitInBytes, //input
cudnnConvolutionFwdAlgo_t *algo) //output
I propose a different something more like this.
cudnnStatus_t cudnnGetConvolutionForwardAlgorithm(
cudnnHandle_t handle, //input
const cudnnTensorDescriptor_t xDesc, //input
const cudnnFilterDescriptor_t wDesc, //input
const cudnnConvolutionDescriptor_t convDesc, //input
const cudnnTensorDescriptor_t yDesc, //input
size_t *memoryinbytes, //input/output
cudnnConvolutionFwdAlgo_t *algo) //output
It is pretty much the same as before, but if you initialize memoryinbytes with some value like zero. Then it will know that you prefer no workspace. If it is higher than zero then it will find the best one considering the amount you initialized it. Then it could return change the value to how much is needed. If you want it to be the fastest with no memory constraints just make it NULL, and it will return the amount of mem needed.