When creating cuDNN backend descriptors, one has to pass concrete values for parameters like the blending constants (“alpha” and “beta”). This means that if only those blending constants change, one has to rebuild the whole operation graph. In my case, I implement SGD by setting the “alpha” blending constant of CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR as minus the learning rate and apply this directly to the kernel weights. However, the learning rate does not stay constant and even though all the operations in my cudnn backend graph stay the same, I cannot reuse it.
Am I missing any possibility to implement this without using an extra operation using the current cudnn 8 api? If not, maybe the following funtionality could be implemented:
I would suggest to add a way to pass pointers to device memory for the blending constants which are only dereferenced in the actual kernel (so the kernel receives the pointers as arguments, not the values). This way, one could simply reuse the cudnn backend graph and just change the value behind this pointer after each training step.
Also, the lack of cudnn support for indirection pointers causes other problems, too. For example, I read that cudnn 8 is now fully compatible with the cuda graph api. One might use this for example when building an inference graph with operations not provided by cudnn: One could record cudnn calls for convolutional layers and for example cublas calls for fully connected layers. Whenever one records cudnn calls for the cudnn graph api, all parameters are captured as constants (not only when capturing the backend api but also the front end api). Thus, when building an inference graph with a cudnn call in the first layer it is impossible to change the input memory pointers - so it is not possible to load in multiple input batches onto the gpu in parallel to hide the memory transfer latency as one would have to use the same memory for every batch. This problem arises whenever using a convolution layer as the first layer which is quite common for cnns. Usually, cuda provides a way to only change kernel parameters in graphs using cudaGraphExecKernelNodeSetParams but users have no access to which kernels cudnn acutally uses and the kernel nodes are “black boxes” for the user.
The same fix would apply here: Instead of requiering the users to pass a concrete pointer for the tensors memory, let them pass a indirection pointer (a pointer to this pointer) which is only dereferenced in the kernel itself. That way, on could simply change the input memory after each training step as now it is impossible to use cudnn in cuda graphs if the input memory ever changes.
If I missed any way to implement these operations with the current api, please let me know.