Description
I found that kernel launch implemented in TensorRT Plugin is very slow.
For checking this problem clearly, I test kernel launch with/without tensorrt(plugin) using simply matrix multiplication(AB=C) kernel with same grid/block dimensions.
And I found for-loop in kernel causes performance drop.
Is it a bug ?
__global__
void matmulNaive(float const* A, float const* B, float* C, int const m, int const k, int const n)
{
unsigned int row = blockDim.y * blockIdx.y + threadIdx.y;
unsigned int col = blockDim.x * blockIdx.x + threadIdx.x;
float sum = 0.f;
if (row < m && col < n) {
for (int i = 0; i < k; i++) {
sum += A[k * row + i] * B[n * i + col];
}
C[n * row + col] = sum;
}
}
Test condition is below:
- matrix A: (1024, 1024)
- matrix B: (1024, 1024)
- Block Dimensions: (32, 32)
- Grid Dimensions: (32, 32)
- In TensorRT, a network consists of only matmul plugin (just 2 inputs for matrix A, B and 1 plugin layer for matrix multiplication)
Profiling result is here:
- explicit kernel launch elapsed time: 1.359 ms
- kernel launch through tensorrt plugin: 13.668 ms
- explicit kernel launch
matmul.cu (5.0 KB)
matmulNaive(const float *, const float *, float *, int, int, int)
Begins: 3.5009s
Ends: 3.50226s (+1.359 ms)
grid: <<<32, 32, 1>>>
block: <<<32, 32, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 36
Local Memory Per Thread: 0 bytes
Local Memory Total: 60,162,048 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 66.6667 %
Launched from thread: 663353
Latency: ←143.964 μs
Correlation ID: 120
Stream: Stream 13
- kernel launch through TensorRT plugin
matmul_dynamic.h (3.0 KB)
matmul_dynamic.cu (7.4 KB)
matmulNaive(const float *, const float *, float *, int, int, int)
Begins: 1.24576s
Ends: 1.25943s (+13.668 ms)
grid: <<<32, 32, 1>>>
block: <<<32, 32, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 20
Local Memory Per Thread: 0 bytes
Local Memory Total: 115,867,648 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 66.6667 %
Launched from thread: 644714
Latency: ←260.701 μs
Correlation ID: 3005
Stream: Stream 48
Environment
TensorRT Version: 8.4.1.5 (i test on all newer versions, but result is same; 8.5.1.7, 8.5.2.2, 8.5.3.1, 8.6.0.12)
GPU Type: RTX 3080
Nvidia Driver Version: 525.85.12
CUDA Version: CUDA 11.8
CUDNN Version: cuDNN 8.6.0
Operating System + Version: ubuntu 20.04.5 LTS
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
matmul.cu (5.0 KB)
matmul_dynamic.h (3.0 KB)
matmul_dynamic.cu (7.4 KB)
Steps To Reproduce
Refer attached files