Hello,
I am currently working on CUDA version 10.2 optimising a closed loop deep learning library. In doing this I have noticed that the method of launching new kernels for every layer.
I am thinking of implementing a graph to reduce the launch overhead of this however I cannot find any working examples of this.
Is there anyone who has a working example of graph implementation on cuda 10.2 ?
Thanks in advance.
Luca