CUDA Error Running nvDecInfer_detection

All,

I am trying to install and run deepstream SDK on GTX 1080 Ti GPU.
decPerf and nvDecInfer_classification sample application seems to be running well. But when i try to run
nvDecInfer_detection, i have the below error. Am i missing anything ?

[DEBUG][21:34:52] Device ID for display [0]: GeForce GTX 1080 Ti
[DEBUG][21:34:52] Device ID for inference [0]: GeForce GTX 1080 Ti
[DEBUG][21:34:52] Video channels: 4
[ERROR][21:34:52] Warning: No mean files.
[DEBUG][21:34:52] GUI enabled.
[DEBUG][21:34:52] Endless Loop: 0
[DEBUG][21:34:52] Device name: GeForce GTX 1080 Ti
[DEBUG][21:34:53] Use INT8 data type.
Cuda error in file src/implicit_gemm.cu at line 1214: invalid configuration argument
customWinogradConvActLayer.cpp (301) - Cuda Error in execute: 9
customWinogradConvActLayer.cpp (301) - Cuda Error in execute: 9
sample_detection: src/nvInferLite.cpp:444: void NvInferLite::caffeToTensorRTModel(const char*, const char*, nvcaffeparser1::ICaffeParser*, size_t): Assertion `pEngine_’ failed.
./run.sh: line 43: 28496 Aborted (core dumped) …/bin/sample_detection -devID_display={DISPLAY_GPU} -devID_infer={INFER_GPU} -nChannels={CHANNELS} -fileList={FILE_LIST} -deployFile={DEPLOY} -modelFile={MODEL} -labelFile={LABEL} -int8=1 -calibrationTableFile={CALIBRATION} -tileWidth={TILE_WIDTH} -tileHeight={TILE_HEIGHT} -tilesInRow=${TILES_IN_ROW} -fullscreen=0 -gui=1 -endlessLoop=0

Hi,

CUDA error 9 indicates invalid configuration:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038

<b>cudaErrorInvalidConfiguration = 9</b>
This indicates that a kernel launch is requesting resources that can never be satisfied by the 
current device. Requesting more shared memory per block than the device supports will trigger this 
error, as will requesting too many threads or blocks. See cudaDeviceProp for more device limitations.

Could you monitor the GPU memory usage with nvidia-smi?

Thanks.

Mon May 21 08:56:42 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:65:00.0 On | N/A |
| 23% 38C P5 13W / 250W | 62MiB / 11177MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1268 G /usr/lib/xorg/Xorg 59MiB |

Hi,

This is the usage when the process tries to start and then crashes with Assert.
Every 1.0s: nvidia-smi Mon May 21 08:58:41 2018

Mon May 21 08:58:41 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:65:00.0 On | N/A |
| 23% 42C P2 56W / 250W | 296MiB / 11177MiB | 9% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1268 G /usr/lib/xorg/Xorg 59MiB |
| 0 30513 C …/bin/sample_detection 225MiB |
±-

Thanks

Hi,

Removing -int8=1 from run.sh is able to bypass this error. Does this give any clues

…/bin/sample_detection -devID_display={DISPLAY_GPU} \ -devID_infer={INFER_GPU}
-nChannels={CHANNELS} \ -fileList={FILE_LIST}
-deployFile={DEPLOY} \ -modelFile={MODEL}
-labelFile={LABEL} \ -int8=1 \ -calibrationTableFile={CALIBRATION}
-tileWidth={TILE_WIDTH} \ -tileHeight={TILE_HEIGHT}
-tilesInRow=${TILES_IN_ROW}
-fullscreen=0
-gui=1
-endlessLoop=0

Thanks

Hi,

Could you share your CUDA/cuDNN/TensorRT/DeepStream version with us?

For DeepStream-1.5, please remember to install the recommended package for compatibility.

2.1 >> SYSTEM REQUIREMENTS
► Ubuntu 16.04 LTS (with GCC 5.4)
► NVIDIA Display Driver R384
► NVIDIA VideoSDK 8.0
► NVIDIA CUDA® 9.0
► cuDNN 7 & TensorRT 3.0

Thanks.

Hi,

Ubuntu - DISTRIB_DESCRIPTION=“Ubuntu 16.04.4 LTS”

    cuDNN - 7 CUDA - 9 Nvidia - 384 DeepStream - 1.5 TensorRT-4.0.0.3

    I did try with TensorRT - 3.0.4 and do have the same error.

    Thanks

Hi,

Could you try CUDA sample to help us narrow down the issue?
Thanks.

Hi,

Any specific sample program output are you looking for?

Thanks
Pramod

Hi,

Could you test a sample with CUDA kernel code?
For example, vectorAdd located at ‘samples/0_Simple/vectorAdd’?

Thanks.

CUDA Sample outpus

./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

./vectorAddDrv

Using Device 0: “GeForce GTX 1080 Ti” with Compute 6.1 capability
findModulePath found file at <./vectorAdd_kernel64.ptx>
initCUDA loading module: <./vectorAdd_kernel64.ptx>
PTX JIT log:

Result = PASS

Hi,

Based on the log, CUDA toolkit and display driver works correctly on your environment.
Could you test the cuDNN functionality further?

cp -r /usr/src/cudnn_samples_v7/ .
cd cudnn_samples_v7/mnistCUDNN/
make
./mnistCUDNN

Thanks.

/mnistCUDNN
cudnnGetVersion() : 7005 , CUDNN_VERSION from cudnn.h : 7005 (7.0.5)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28 Capabilities 6.1, SmClock 1582.0 Mhz, MemSize (Mb) 11171, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation …
Testing cudnnGetConvolutionForwardAlgorithm …
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm …
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027648 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.036864 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.036864 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.070240 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.096064 time requiring 203008 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
Loading image data/three_28x28.pgm
Performing forward propagation …
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation …
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation …
Testing cudnnGetConvolutionForwardAlgorithm …
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm …
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.020480 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.023552 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.036864 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.068576 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.094080 time requiring 203008 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
Loading image data/three_28x28.pgm
Performing forward propagation …
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation …
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006

Result of classification: 1 3 5

Test passed!

Hi,

Looks like the error is from TensorRT package.
Could you try to reinstall TensorRT from our website?
https://developer.nvidia.com/tensorrt

To compatible with CUDA 9.0, please remember to choose TensorRT 3.0.4 for Ubuntu 1604 and CUDA 9.0 DEB local repo packages.
Thanks and please let us know the results.