My aim is to run multiple ConvForward CUDNN calls parallely, all have the access to independent data and creation of different CUDNN handles is not a problem.
So far I have wrapped the whole of CUDNN calls under a class, and how do I pass the specific class object into different CUDA streams or is there any other smarter way to parallelize the ConvForward other than wrapping it up in a class and creating an object to the class and passing it into streams?
Any discussion leading to a solution is way much appreciated.