Why is the _process_yolo_output in the tensorrRT sample code so slow? It takes 0.3 seconds to execut

I use tensorRT code for reasoning yolov3_tiny, the single reasoning time can be up to 0.03s, the pure reasoning is close to 25fps, but why does the _process_yolo_output in data_processing.py need 0.3s once?

How do I accelerate? My input image size is only 416.

I have one more question, that is, I feel the GPU usage of jetson nano is not stable. I open tegrastats to check, “EMC_FREQ 0% GR3D_FREQ 99%”, sometimes 99%, sometimes 13%, sometimes 0%, is this normal?
When I run my yolov3_tiny, my memory is full, too

How do I fix it, thank you

How do I fix it, thank you

Check out my implementation. “yolov3-tiny-416 (FP16)” runs at 14.2 FPS on my Jetson Nano.

Hi, your project looks great!
I want to ask is 14.2 FPS pure reasoning or reasoning and post-processing?

14.2 FPS is for all of “preprocessing” + “TensorRT inferencing” + “postprocessing” of “yolov3-tiny-416 (FP16)”

Great! I will try, and then I will make a comparison after success. Why is the official sample so slow

I rewrote these 2 functions in NVIDIA’s orginal yolov3_onnx sample.

def sigmoid(value):
    """Return the sigmoid of the input."""
    return 1.0 / (1.0 + math.exp(-value))                               

def exponential(value):
    """Return the exponential of the input."""
    return math.exp(value)

# Vectorized calculation of above two functions:
sigmoid_v = np.vectorize(sigmoid)
exponential_v = np.vectorize(exponential)

to:

def sigmoid_v(array):
    return np.reciprocal(np.exp(-array) + 1.0)

def exponential_v(array):
    return np.exp(array)

In short, I just used numpy to vectorize the computations.

to:

Hi, I’m running your project. It looks great! From math to numpy, FPS has been greatly improved!

But now I have another problem. The nano seems to be running out of memory. The running code will report the following errors
"[TensorRT] ERROR:… /rtSafe/ safecontext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

ERROR: [TensorRT]… /rtSafe/ safecontext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
[TensorRT] ERROR: FAILED_EXECUTION: STD ::exception|2020-02-28 12:02:39

The 2020-02-28 12:02:48. 281436: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 533.62MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 442000: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 134.44MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 496768: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 134.20MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 548342: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 137.62MiB. But may mean that there could be performance gains if more memory were available."

Do you have any Suggestions? How do I solve

I mean, tensorrt model can not be loaded with other model together??
I also have a tracking network

For this reson I don’t use TRT, using darknet and yolov3-tiny-prn 416x416 (python library) I’m able to reach 18-20 fps… So why I should use TRT? -_-

18-20 fps not only using inference time to calculate fps but with all payload, so:

start_time = time.time();

<ol>
<li>reading frame from RTSP</li>
<li>do all my stuff</li>
<li>resizing frame</li>
<li>converting frame to right color format</li>
<li>inference</li>
</ol>

end_time = time.time();
fps = 1 / (end_time - start_time)  -> <b>18-20 fps</b>

removing my application stuff I could increase speed.

Could you tell me how to do this? Can you share the link?

https://github.com/AlexeyAB/darknet

Use YoloV3-tiny-prn

Config
https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov3-tiny-prn.cfg

Pre-trained model:
https://drive.google.com/file/d/18yYZWyKbo4XSDVyztmsEcF9B_6bxrhUY/view?usp=sharing

This looks great!
I would like to ask how does this migrate into my own project?
Can you give me some ideas?

Well is quite simple, you need to compile it whit this setting:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=0

Uncomment this line:

# For Jetson TX1, Tegra X1, DRIVE CX, DRIVE PX - uncomment:
ARCH= -gencode arch=compute_53,code=[sm_53,compute_53]

and change:

NVCC=nvcc

to:

NVCC=/usr/local/cuda/bin/nvcc

compile it with:

make -j4

and you are ready to go, you will have:

libdarknet.so
darknet.py

all necessary to use yolov3-tiny-prn fron python.