Hi,

Attached is a tar file showing this issue.

You should run it like this: reset; ./a.out 1 64 512 512 64 7 7 1 1 0 1 0

It runs a 7x7 convolution via TRT and a simple test CUDA kernel.

There are 4 modes, please change the test_type variable in concurrentTest.cu line 297 and compile and run each time.

Those are the numbers I get on my Xavier:

**EConvolutionOnly:**

**Total host : [7270.08 ms]**

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 86.46% 6.31876s 500 12.638ms 12.503ms 16.393ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1

7.27% 531.61ms 500 1.0632ms 1.0351ms 1.6980ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)

6.27% 458.11ms 500 916.23us 893.64us 1.5226ms void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)

**ECUDAOnly:**

**Total host : [6625.44 ms]**

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 6.61508s 20 330.75ms 330.36ms 332.04ms kernel(float*, int)

API calls: 99.98% 6.61412s 4 1.65353s 20.928us 6.61399s cudaDeviceSynchronize

**EConvolutionFollowedByCUDA:**

**Total host : [13989 ms]**

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 47.34% 6.63279s 20 331.64ms 330.48ms 337.87ms kernel(float*, int)

45.48% 6.37232s 500 12.745ms 12.500ms 20.314ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1

3.84% 538.37ms 500 1.0767ms 1.0350ms 2.0958ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)

3.34% 468.66ms 500 937.32us 902.31us 1.9592ms void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)

**EConvolutionAndCUDAConcurrently:**

**Total host : [14023.5 ms]**

GPU activities: 47.71% 6.61691s 20 330.85ms 330.41ms 332.79ms kernel(float*, int)

45.20% 6.26890s 500 12.538ms 12.504ms 12.583ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1

3.80% 526.53ms 500 1.0531ms 1.0293ms 1.1323ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)

3.29% 456.95ms 500 913.89us 896.17us 967.62us void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)

concurrentTest_tar.txt (30 KB)

Thanks

Eyal