Hi,
Attached is a tar file showing this issue.
You should run it like this: reset; ./a.out 1 64 512 512 64 7 7 1 1 0 1 0
It runs a 7x7 convolution via TRT and a simple test CUDA kernel.
There are 4 modes, please change the test_type variable in concurrentTest.cu line 297 and compile and run each time.
Those are the numbers I get on my Xavier:
EConvolutionOnly:
Total host : [7270.08 ms]
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 86.46% 6.31876s 500 12.638ms 12.503ms 16.393ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1
7.27% 531.61ms 500 1.0632ms 1.0351ms 1.6980ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)
6.27% 458.11ms 500 916.23us 893.64us 1.5226ms void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const , float, int, int, int, int, int, int)
ECUDAOnly:
Total host : [6625.44 ms]
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 6.61508s 20 330.75ms 330.36ms 332.04ms kernel(float*, int)
API calls: 99.98% 6.61412s 4 1.65353s 20.928us 6.61399s cudaDeviceSynchronize
EConvolutionFollowedByCUDA:
Total host : [13989 ms]
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 47.34% 6.63279s 20 331.64ms 330.48ms 337.87ms kernel(float*, int)
45.48% 6.37232s 500 12.745ms 12.500ms 20.314ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1
3.84% 538.37ms 500 1.0767ms 1.0350ms 2.0958ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)
3.34% 468.66ms 500 937.32us 902.31us 1.9592ms void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const , float, int, int, int, int, int, int)
EConvolutionAndCUDAConcurrently:
Total host : [14023.5 ms]
GPU activities: 47.71% 6.61691s 20 330.85ms 330.41ms 332.79ms kernel(float*, int)
45.20% 6.26890s 500 12.538ms 12.504ms 12.583ms trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_medium_nhwc_tn_v1
3.80% 526.53ms 500 1.0531ms 1.0293ms 1.1323ms void cuInt8::nchwTonhwc<float, int=32, int=32, int=2>(float const , __half, int, int, int, int, int, int, int, int)
3.29% 456.95ms 500 913.89us 896.17us 967.62us void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const , float, int, int, int, int, int, int)
concurrentTest_tar.txt (30 KB)
Thanks
Eyal