Cudnn call with cuda streams

My aim is to run multiple ConvForward CUDNN calls parallely, all have the access to independent data and creation of different CUDNN handles is not a problem.

So far I have wrapped the whole of CUDNN calls under a class, and how do I pass the specific class object into different CUDA streams or is there any other smarter way to parallelize the ConvForward other than wrapping it up in a class and creating an object to the class and passing it into streams?

Any discussion leading to a solution is way much appreciated.

Update: A quote from the documentation of the CUDNN states that the CUDNN function calls can be sort of parallelized using the CUDA streams and also using multiple host threads. Is there any documentation which could explain it.