NVIDIA previously hosted our 3 part webinar series for CUDA/TensorRT on DRIVE AGX. Recordings are now available.
In this webinar we introduce CUDA cores, threads, blocks, gird, and stream and the TensorRT workflow. We also cover CUDA memory management and TensorRT optimization, and how you can deploy optimized deep learning networks using TensorRT samples on NVIDIA DRIVE AGX.
The second installment of this webinar series explains how to extend TensorRT with custom operations, running custom layers through TensorRT using the plugin interface. For the fastest implementation of custom layers, it is necessary to use the same GPU by building CUDA kernels on which the optimized engine will run. We then cover TensorRT plugins and how to adapt CUDA kernel as a part of the TensorRT plugin for DNN model optimization with a sample application.
Concurrent execution of multiple GPU inferencing tasks provides potential performance optimization when compared to its serialized counterpart. As a real-world use case, we implement a multi-network inference pipeline for object detection and lane segmentation. In building this application, we show how to achieve kernel concurrency using multiple CUDA Streams and CUDA Graphs. We then introduce how to use NVIDIA NSight Systems to profile the application, showing the performance gains from implementing concurrency.