Preparing fasterrcnn model to deploy at deepstream trained with Tao

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Orin nano
• DeepStream Version 6.4
• JetPack Version (valid for Jetson only) 6.0
I am following discussion here to deploy fasterrcnn at Deepstream.

I have downloaded TensorRT and build as

git clone -b 23.08 https://github.com/nvidia/TensorRT
cd TensorRT/
git submodule update --init --recursive
export TRT_SOURCE=`pwd`
cd $TRT_SOURCE
mkdir -p build && cd build

Then

/usr/local/bin/cmake .. -DGPU_ARCHS=72  -DTRT_LIB_DIR=/usr/lib/aarch64-linux-gnu/ -DCMAKE_C_COMPILER=/usr/bin/gcc -DTRT_BIN_DIR=`pwd`/out

Is -DGPU_ARCHS=72? Because there is no architecture mentioned for Orin.
When run the command

make nvinfer_plugin -j$(nproc)

I have errors as follows. How to solver?

[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/sortScoresPerClass.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/sortScoresPerImage.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/voxelGeneratorKernels.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/bertQKVToContextPlugin/qkvToContext.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/bertQKVToContextPlugin/zeroPadding2d.cu.o
[ 97%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/embLayerNormPlugin/embLayerNormKernel.cu.o
/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=float, TPB=128, VPT=4]" at line 230
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=float]" at line 591

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=float, TPB=384, VPT=4]" at line 236
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=float]" at line 591

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=half, TPB=128, VPT=8]" at line 230
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=half]" at line 564

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=half, TPB=384, VPT=8]" at line 236
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=half]" at line 564

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=128, VPT=1]" at line 307
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=128, VPT=4]" at line 314
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=384, VPT=1]" at line 324
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=384, VPT=4]" at line 331
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=128, VPT=2]" at line 307
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=128, VPT=8]" at line 314
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=384, VPT=2]" at line 324
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=384, VPT=8]" at line 331
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu(228): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
                           ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253

/usr/local/cuda-12.2/include/cub/warp/specializations/warp_reduce_shfl.cuh(360): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              output = reduction_op(input, temp);
                       ^
          detected during:
            instantiation of "_T cub::CUB_200200_720_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(_T, ReductionOp, int, int) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, _T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 388
            instantiation of "_T cub::CUB_200200_720_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(_T, ReductionOp, int, int, cub::CUB_200200_720_NS::Int2Type<0>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, _T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 403
            instantiation of "void cub::CUB_200200_720_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(T &, ReductionOp, int, cub::CUB_200200_720_NS::Int2Type<STEP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum, STEP=0]" at line 449
            instantiation of "T cub::CUB_200200_720_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceImpl(cub::CUB_200200_720_NS::Int2Type<1>, T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 530
            instantiation of "T cub::CUB_200200_720_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::Reduce<ALL_LANES_VALID,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ALL_LANES_VALID=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 204 of /usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=6]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            [ 2 instantiation contexts not shown ]
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_720_NS::KeyValuePair<float, float>, cub::CUB_200200_720_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_720_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=7]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=6]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_720_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            [ 3 instantiation contexts not shown ]
            instantiation of "T cub::CUB_200200_720_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_720_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_720_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_720_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_720_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

9 errors detected in the compilation of "/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu".
make[3]: *** [plugin/CMakeFiles/nvinfer_plugin.dir/build.make:2806: plugin/CMakeFiles/nvinfer_plugin.dir/embLayerNormPlugin/embLayerNormKernel.cu.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[3]: Leaving directory '/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build'
make[2]: *** [CMakeFiles/Makefile2:330: plugin/CMakeFiles/nvinfer_plugin.dir/all] Error 2
make[2]: Leaving directory '/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build'
make[1]: *** [CMakeFiles/Makefile2:342: plugin/CMakeFiles/nvinfer_plugin.dir/rule] Error 2
make[1]: Leaving directory '/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build'
make: *** [Makefile:225: nvinfer_plugin] Error 2
atic@ubuntu:/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build$

Orin Nano architecture is 8.7.
So I did it again as
/usr/local/bin/cmake … -DGPU_ARCHS=87 -DTRT_LIB_DIR=/usr/lib/aarch64-linux-gnu/ -DCMAKE_C_COMPILER=/usr/bin/gcc -DTRT_BIN_DIR=pwd/out

But still have same errors.

[ 82%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/bboxDeltas2Proposals.cu.o
[ 82%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/common.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/cropAndResizeKernel.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/decodeBBoxes.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/decodeBbox3DKernels.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/detectionForward.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/extractFgScores.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/gatherTopDetections.cu.o
[ 85%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/generateAnchors.cu.o
[ 88%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/gridAnchorLayer.cu.o
[ 88%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/lReLU.cu.o
[ 88%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/maskRCNNKernels.cu.o
[ 88%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/nmsLayer.cu.o
[ 88%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/normalizeLayer.cu.o
[ 88%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/permuteData.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/pillarScatterKernels.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/priorBoxLayer.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/proposalKernel.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/proposalsForward.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/regionForward.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/reorgForward.cu.o
[ 91%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/roiPooling.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/rproiInferenceFused.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/sortScoresPerClass.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/sortScoresPerImage.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/common/kernels/voxelGeneratorKernels.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/bertQKVToContextPlugin/qkvToContext.cu.o
[ 94%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/bertQKVToContextPlugin/zeroPadding2d.cu.o
[ 97%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/embLayerNormPlugin/embLayerNormKernel.cu.o
/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=float, TPB=128, VPT=4]" at line 230
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=float]" at line 591

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=float, TPB=384, VPT=4]" at line 236
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=float]" at line 591

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=half, TPB=128, VPT=8]" at line 230
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=half]" at line 564

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(137): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::softmax<T,TPB,VPT>(float, const T *, T *) [with T=half, TPB=384, VPT=8]" at line 236
            instantiation of "int32_t nvinfer1::plugin::bert::computeScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const T *, T *) [with T=half]" at line 564

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=128, VPT=1]" at line 307
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=128, VPT=4]" at line 314
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=384, VPT=1]" at line 324
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=float, TPB=384, VPT=4]" at line 331
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=float]" at line 587

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=128, VPT=2]" at line 307
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=128, VPT=8]" at line 314
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=384, VPT=2]" at line 324
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/bertQKVToContextPlugin/qkvToContext.cu(55): warning #20054-D: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
      __attribute__((shared)) SMem tmp;
                                   ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::maskedSoftmax<T,TPB,VPT>(float, const T *, T *, const int *) [with T=half, TPB=384, VPT=8]" at line 331
            instantiation of "int32_t nvinfer1::plugin::bert::computeMaskedScaledSoftmax(cudaStream_t, int32_t, int32_t, int32_t, float, const int32_t *, const T *, T *) [with T=half]" at line 560

/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu(228): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
                           ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253

/usr/local/cuda-12.2/include/cub/warp/specializations/warp_reduce_shfl.cuh(360): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              output = reduction_op(input, temp);
                       ^
          detected during:
            instantiation of "_T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(_T, ReductionOp, int, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, _T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 388
            instantiation of "_T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(_T, ReductionOp, int, int, cub::CUB_200200_870_NS::Int2Type<0>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, _T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 403
            instantiation of "void cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(T &, ReductionOp, int, cub::CUB_200200_870_NS::Int2Type<STEP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum, STEP=0]" at line 449
            instantiation of "T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceImpl(cub::CUB_200200_870_NS::Int2Type<1>, T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 530
            instantiation of "T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::Reduce<ALL_LANES_VALID,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ALL_LANES_VALID=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 204 of /usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

[ 97%] Building CUDA object plugin/CMakeFiles/nvinfer_plugin.dir/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu.o
/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=6]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            [ 2 instantiation contexts not shown ]
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=7]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=6]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            [ 3 instantiation contexts not shown ]
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 233 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernel<T,TPB>(int, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 247 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNorm(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 253 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu

9 errors detected in the compilation of "/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormKernel.cu".
make[3]: *** [plugin/CMakeFiles/nvinfer_plugin.dir/build.make:2806: plugin/CMakeFiles/nvinfer_plugin.dir/embLayerNormPlugin/embLayerNormKernel.cu.o] Error 1
make[3]: *** Waiting for unfinished jobs....
/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu(98): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
                           ^
          detected during:
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121

/usr/local/cuda-12.2/include/cub/warp/specializations/warp_reduce_shfl.cuh(360): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              output = reduction_op(input, temp);
                       ^
          detected during:
            instantiation of "_T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(_T, ReductionOp, int, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, _T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 388
            instantiation of "_T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(_T, ReductionOp, int, int, cub::CUB_200200_870_NS::Int2Type<0>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, _T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 403
            instantiation of "void cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceStep(T &, ReductionOp, int, cub::CUB_200200_870_NS::Int2Type<STEP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum, STEP=0]" at line 449
            instantiation of "T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::ReduceImpl(cub::CUB_200200_870_NS::Int2Type<1>, T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 530
            instantiation of "T cub::CUB_200200_870_NS::WarpReduceShfl<T, LOGICAL_WARP_THREADS, LEGACY_PTX_ARCH>::Reduce<ALL_LANES_VALID,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, LOGICAL_WARP_THREADS=32, LEGACY_PTX_ARCH=0, ALL_LANES_VALID=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 204 of /usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=1]" at line 156
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp>(ReductionOp, T, int) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 207
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=6]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=2]" at line 121
            [ 2 instantiation contexts not shown ]
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

/usr/local/cuda-12.2/include/cub/block/specializations/block_reduce_warp_reductions.cuh(119): error: no instance of function template "cuda::std::__4::plus<void>::operator()" matches the argument list
            argument types are: (cub::CUB_200200_870_NS::KeyValuePair<float, float>, cub::CUB_200200_870_NS::KeyValuePair<float, float>)
            object type is: cub::CUB_200200_870_NS::Sum
              warp_aggregate = reduction_op(warp_aggregate, addend);
                               ^
          detected during:
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=7]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=6]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=5]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=4]" at line 121
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::ApplyWarpAggregates<FULL_TILE,ReductionOp,SUCCESSOR_WARP>(ReductionOp, T, int, cub::CUB_200200_870_NS::Int2Type<SUCCESSOR_WARP>) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum, SUCCESSOR_WARP=3]" at line 121
            [ 3 instantiation contexts not shown ]
            instantiation of "T cub::CUB_200200_870_NS::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce<FULL_TILE,ReductionOp>(T, int, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, FULL_TILE=true, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 354 of /usr/local/cuda-12.2/include/cub/block/block_reduce.cuh
            instantiation of "T cub::CUB_200200_870_NS::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, LEGACY_PTX_ARCH>::Reduce(T, ReductionOp) [with T=cub::CUB_200200_870_NS::KeyValuePair<float, float>, BLOCK_DIM_X=256, ALGORITHM=cub::CUB_200200_870_NS::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, LEGACY_PTX_ARCH=0, ReductionOp=cub::CUB_200200_870_NS::Sum]" at line 257 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/common/common.cuh
            instantiation of "void layerNorm<T,R,P,TPB>(const kvp<R> &, int32_t, int32_t, const P *, const P *, T *) [with T=float, R=float, P=float, TPB=256]" at line 103 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "void nvinfer1::plugin::bert::embLayerNormKernelHFace<T,TPB>(int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float, TPB=256U]" at line 117 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
            instantiation of "int32_t nvinfer1::plugin::bert::embSkipLayerNormHFace(cudaStream_t, int32_t, int32_t, int32_t, const int32_t *, const int32_t *, const int32_t *, const float *, const float *, const T *, const T *, const T *, int32_t, int32_t, T *) [with T=float]" at line 121 of /opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu

9 errors detected in the compilation of "/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu".
make[3]: *** [plugin/CMakeFiles/nvinfer_plugin.dir/build.make:2819: plugin/CMakeFiles/nvinfer_plugin.dir/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu.o] Error 1
make[3]: Leaving directory '/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build'
make[2]: *** [CMakeFiles/Makefile2:330: plugin/CMakeFiles/nvinfer_plugin.dir/all] Error 2
make[2]: Leaving directory '/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build'
make[1]: *** [CMakeFiles/Makefile2:342: plugin/CMakeFiles/nvinfer_plugin.dir/rule] Error 2
make[1]: Leaving directory '/opt/nvidia/deepstream/deepstream-6.4/sources/TensorRT/build'
make: *** [Makefile:225: nvinfer_plugin] Error 2

Why do you need to compile and install tensorrt?

In fact on the Jetson platform, tenssorrt is part of Jetpack.

In addition, for Fasterrcnn, you can refer to this repo.

Is it possible to install Deepstream 6.3 into Jetson Orin Nano?
I have this errors.

To run faster rcnn on Jetson AGX Xavier, I followed the instructions here to use deepstream-app.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Yes, of course.

If you want to run on DS-6.3, please switch to the release/tao4.0_ds6.3ga branch.

In addition, I recommend that you read the README directly in the link above.

This instructions seems to be provided by TAO, not DeepStream, not sure if it matches DS-6.3.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.