Why is the _process_yolo_output in the tensorrRT sample code so slow? It takes 0.3 seconds to execut

gzchenjiajun · February 27, 2020, 1:40pm

I use tensorRT code for reasoning yolov3_tiny, the single reasoning time can be up to 0.03s, the pure reasoning is close to 25fps, but why does the _process_yolo_output in data_processing.py need 0.3s once?

How do I accelerate? My input image size is only 416.

I have one more question, that is, I feel the GPU usage of jetson nano is not stable. I open tegrastats to check, “EMC_FREQ 0% GR3D_FREQ 99%”, sometimes 99%, sometimes 13%, sometimes 0%, is this normal?
When I run my yolov3_tiny, my memory is full, too

How do I fix it, thank you

gzchenjiajun · February 28, 2020, 1:58am

How do I fix it, thank you

jkjung13 · February 28, 2020, 2:55am

Check out my implementation. “yolov3-tiny-416 (FP16)” runs at 14.2 FPS on my Jetson Nano.

gzchenjiajun · February 28, 2020, 3:05am

Hi, your project looks great!
I want to ask is 14.2 FPS pure reasoning or reasoning and post-processing?

jkjung13 · February 28, 2020, 3:08am

14.2 FPS is for all of “preprocessing” + “TensorRT inferencing” + “postprocessing” of “yolov3-tiny-416 (FP16)”

gzchenjiajun · February 28, 2020, 3:13am

Great! I will try, and then I will make a comparison after success. Why is the official sample so slow

jkjung13 · February 28, 2020, 3:31am

I rewrote these 2 functions in NVIDIA’s orginal yolov3_onnx sample.

def sigmoid(value):
    """Return the sigmoid of the input."""
    return 1.0 / (1.0 + math.exp(-value))                               

def exponential(value):
    """Return the exponential of the input."""
    return math.exp(value)

# Vectorized calculation of above two functions:
sigmoid_v = np.vectorize(sigmoid)
exponential_v = np.vectorize(exponential)

to:

def sigmoid_v(array):
    return np.reciprocal(np.exp(-array) + 1.0)

def exponential_v(array):
    return np.exp(array)

In short, I just used numpy to vectorize the computations.

gzchenjiajun · February 28, 2020, 4:07am

I rewrote these 2 functions in NVIDIA’s orginal yolov3_onnx sample.
def sigmoid(value):
    """Return the sigmoid of the input."""
    return 1.0 / (1.0 + math.exp(-value))                               

def exponential(value):
    """Return the exponential of the input."""
    return math.exp(value)

# Vectorized calculation of above two functions:
sigmoid_v = np.vectorize(sigmoid)
exponential_v = np.vectorize(exponential)
to:

Hi, I’m running your project. It looks great! From math to numpy, FPS has been greatly improved!

But now I have another problem. The nano seems to be running out of memory. The running code will report the following errors
"[TensorRT] ERROR:… /rtSafe/ safecontext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

ERROR: [TensorRT]… /rtSafe/ safecontext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
[TensorRT] ERROR: FAILED_EXECUTION: STD ::exception|2020-02-28 12:02:39

The 2020-02-28 12:02:48. 281436: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 533.62MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 442000: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 134.44MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 496768: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 134.20MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 548342: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 137.62MiB. But may mean that there could be performance gains if more memory were available."

Do you have any Suggestions? How do I solve
def sigmoid_v(array):
    return np.reciprocal(np.exp(-array) + 1.0)

def exponential_v(array):
    return np.exp(array)
In short, I just used numpy to vectorize the computations.

gzchenjiajun · February 28, 2020, 4:45am

I rewrote these 2 functions in NVIDIA’s orginal yolov3_onnx sample.

def sigmoid(value):
    """Return the sigmoid of the input."""
    return 1.0 / (1.0 + math.exp(-value))                               

def exponential(value):
    """Return the exponential of the input."""
    return math.exp(value)

# Vectorized calculation of above two functions:
sigmoid_v = np.vectorize(sigmoid)
exponential_v = np.vectorize(exponential)

to:

def sigmoid_v(array):
    return np.reciprocal(np.exp(-array) + 1.0)

def exponential_v(array):
    return np.exp(array)

In short, I just used numpy to vectorize the computations.

to:

Hi, I’m running your project. It looks great! From math to numpy, FPS has been greatly improved!

But now I have another problem. The nano seems to be running out of memory. The running code will report the following errors
"[TensorRT] ERROR:… /rtSafe/ safecontext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

ERROR: [TensorRT]… /rtSafe/ safecontext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
[TensorRT] ERROR: FAILED_EXECUTION: STD ::exception|2020-02-28 12:02:39

The 2020-02-28 12:02:48. 281436: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 533.62MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 442000: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 134.44MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 496768: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 134.20MiB. But may mean that there could be performance gains if more memory were available.

The 2020-02-28 12:02:48. 548342: W tensorflow/core/common_runtime/bfc_allocator. Cc :211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 137.62MiB. But may mean that there could be performance gains if more memory were available."

Do you have any Suggestions? How do I solve

gzchenjiajun · February 28, 2020, 5:49am

I mean, tensorrt model can not be loaded with other model together??
I also have a tracking network

simone.rinaldi · February 28, 2020, 10:12am

For this reson I don’t use TRT, using darknet and yolov3-tiny-prn 416x416 (python library) I’m able to reach 18-20 fps… So why I should use TRT? -_-

18-20 fps not only using inference time to calculate fps but with all payload, so:

start_time = time.time();

<ol>
<li>reading frame from RTSP</li>
<li>do all my stuff</li>
<li>resizing frame</li>
<li>converting frame to right color format</li>
<li>inference</li>
</ol>

end_time = time.time();
fps = 1 / (end_time - start_time)  -> <b>18-20 fps</b>

removing my application stuff I could increase speed.

gzchenjiajun · February 28, 2020, 11:22am

14.2 FPS is for all of “preprocessing” + “TensorRT inferencing” + “postprocessing” of “yolov3-tiny-416 (FP16)”

For this reson I don’t use TRT, using darknet and yolov3-tiny-prn 416x416 (python library) I’m able to reach 18-20 fps… So why I should use TRT? -_-

18-20 fps not only using inference time to calculate fps but with all payload, so:
start_time = time.time();

<ol>
<li>reading frame from RTSP</li>
<li>do all my stuff</li>
<li>resizing frame</li>
<li>converting frame to right color format</li>
<li>inference</li>
</ol>


end_time = time.time();
fps = 1 / (end_time - start_time)  -> <b>18-20 fps</b>
removing my application stuff I could increase speed.

Could you tell me how to do this? Can you share the link?

simone.rinaldi · February 28, 2020, 11:29am

https://github.com/AlexeyAB/darknet

Use YoloV3-tiny-prn

Config
https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov3-tiny-prn.cfg

Pre-trained model:
https://drive.google.com/file/d/18yYZWyKbo4XSDVyztmsEcF9B_6bxrhUY/view?usp=sharing

gzchenjiajun · February 29, 2020, 4:32am

This looks great!
I would like to ask how does this migrate into my own project?
Can you give me some ideas?

simone.rinaldi · March 2, 2020, 8:31am

Well is quite simple, you need to compile it whit this setting:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=0

Uncomment this line:

# For Jetson TX1, Tegra X1, DRIVE CX, DRIVE PX - uncomment:
ARCH= -gencode arch=compute_53,code=[sm_53,compute_53]

and change:

NVCC=nvcc

to:

NVCC=/usr/local/cuda/bin/nvcc

compile it with:

make -j4

and you are ready to go, you will have:

libdarknet.so
darknet.py

all necessary to use yolov3-tiny-prn fron python.

Topic		Replies	Views
Yolov3 is very slow Jetson Nano	20	20698	January 1, 2020
Jetson Nano YoloV3 performance Jetson Nano	5	2759	May 9, 2019
YOLOv3 TensorRT Inference Super Slow In Nano Jetson Nano	2	1164	December 31, 2019
0.3fps when using yolov3_onnx in TensorRT examples provided by Nvidia in Jetson Nano Jetson Nano	7	1949	December 24, 2020
Jetson nano crashed when using tiny yolo v3 model Jetson Nano	23	12903	August 19, 2019
Python wrapper for tensorrt implementation of Yolo (currently v2) Jetson Nano	32	8530	July 2, 2020
Jetson Nano takes 30-40 secs for loading a Tensorflow YOLOv3 model Jetson Nano	5	1350	December 4, 2019
Low FPS on tensorRT YoloV3 Jetson Nano Jetson Nano tensorrt	1	793	March 22, 2021
TensorRT optimization random outcome Jetson Nano	4	934	November 15, 2019
Low FPS on Jetson Nano using TensorRT Jetson Nano tensorrt , tensorflow	7	1384	August 27, 2020

Why is the _process_yolo_output in the tensorrRT sample code so slow? It takes 0.3 seconds to execut

Related topics