I want to improve the memory access efficiency when training a network. Is there any available API to support the operator fusion to reduce the data transfer?
There is limited Fused Ops support in 7.6.x.
Please refer below link for more details:
Hi, thanks for your answer. I have figured it out.
One more question, do you have code example for nonlinear network, e.g. resnet, googlenet, written in c++ or cuda from scratch? I am looking into how to implement nonlinear blocks. Thanks