Kernel launch implemented in TensorRT Plugin is very slow

Description

I found that kernel launch implemented in TensorRT Plugin is very slow.
For checking this problem clearly, I test kernel launch with/without tensorrt(plugin) using simply matrix multiplication(AB=C) kernel with same grid/block dimensions.

And I found for-loop in kernel causes performance drop.
Is it a bug ?

__global__
void matmulNaive(float const* A, float const* B, float* C, int const m, int const k, int const n)
{
    unsigned int row = blockDim.y * blockIdx.y + threadIdx.y;
    unsigned int col = blockDim.x * blockIdx.x + threadIdx.x;

    float sum = 0.f;
    if (row < m && col < n) {
        for (int i = 0; i < k; i++) {
            sum += A[k * row + i] * B[n * i + col];
        }
        C[n * row + col] = sum;
    }
}

Test condition is below:

  • matrix A: (1024, 1024)
  • matrix B: (1024, 1024)
  • Block Dimensions: (32, 32)
  • Grid Dimensions: (32, 32)
  • In TensorRT, a network consists of only matmul plugin (just 2 inputs for matrix A, B and 1 plugin layer for matrix multiplication)

Profiling result is here:

  • explicit kernel launch elapsed time: 1.359 ms
  • kernel launch through tensorrt plugin: 13.668 ms
  1. explicit kernel launch
    matmul.cu (5.0 KB)

image

matmulNaive(const float *, const float *, float *, int, int, int)
Begins: 3.5009s
Ends: 3.50226s (+1.359 ms)
grid:  <<<32, 32, 1>>>
block: <<<32, 32, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 36
Local Memory Per Thread: 0 bytes
Local Memory Total: 60,162,048 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 66.6667 %
Launched from thread: 663353
Latency: ←143.964 μs
Correlation ID: 120
Stream: Stream 13
  1. kernel launch through TensorRT plugin
    matmul_dynamic.h (3.0 KB)
    matmul_dynamic.cu (7.4 KB)

image

matmulNaive(const float *, const float *, float *, int, int, int)
Begins: 1.24576s
Ends: 1.25943s (+13.668 ms)
grid:  <<<32, 32, 1>>>
block: <<<32, 32, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 20
Local Memory Per Thread: 0 bytes
Local Memory Total: 115,867,648 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 66.6667 %
Launched from thread: 644714
Latency: ←260.701 μs
Correlation ID: 3005
Stream: Stream 48

Environment

TensorRT Version: 8.4.1.5 (i test on all newer versions, but result is same; 8.5.1.7, 8.5.2.2, 8.5.3.1, 8.6.0.12)
GPU Type: RTX 3080
Nvidia Driver Version: 525.85.12
CUDA Version: CUDA 11.8
CUDNN Version: cuDNN 8.6.0
Operating System + Version: ubuntu 20.04.5 LTS
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

matmul.cu (5.0 KB)
matmul_dynamic.h (3.0 KB)
matmul_dynamic.cu (7.4 KB)

Steps To Reproduce

Refer attached files

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

matmul-kernel-test.nsys-rep (351.4 KB)
matmul-plugin-test.nsys-rep (406.4 KB)

I attach profiling output files. (nsight system version is 2022.4.2)

This problem is caused by wrong usage of cmake script. (resolved)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.