pytorch JIT also claims to optimize CUDA kernels by batching smaller ones into larger ones. Has anybody done comparison between the increase of throughput from pytorch -> jit and pytorch -> tensorRT?
I have the same question.
Any one knows?
pytorch JIT also claims to optimize CUDA kernels by batching smaller ones into larger ones. Has anybody done comparison between the increase of throughput from pytorch -> jit and pytorch -> tensorRT?
I have the same question.
Any one knows?