100FPS with Jetson-inference and SSD Mobilenet v2

Hi everyone,

I’m running ssd-mobilenet v2 with Jetson-Inference with my-detection.py script from the example folder. I’m getting arround 100FPS but the results announced on different benchmarks on the web are more about 800FPS with ssd-mobilenet v1. Is it because I use python script ? How can I run ssd-mobilenet v1 with detectnet ?

Thanks!

Hi,

Please make sure you have maximized the device performance first.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Then you can reproduce the XavierNX performance with the script here:

Thanks.

Note that the official benchmarks are using INT8 precision, GPU + 2x DLA’s, and batching - whereas jetson-inference uses FP16, batch size 1, on GPU only.

If you want to run SSD-Mobilenet-v1, first download it using the Model Downloader tool. Then launch the detectnet program with the --network=ssd-mobilenet-v1 flag.

Thank you this is exactly what I needed to know!

Do you know if I can use GPU+2x DLA + batching with Jetson-inference lib (changing some parameters) ?

It’s not in the jetson-inference Python API to do it, and I haven’t tested DLA on object detection in jetson-inference. There is however this app from jetson-inference that runs GPU+2xDLA on image classification model from C++:

Thank you @dusty_nv, I’ve tried jetson-benchmark and I could get around 900FPS!
But impossible to know how I can set the input frames and get the output of inference, so how can I run ssd Mobilenet v1 using GPU + DLA and get results of inference (detection and coordinates) ? Also can I use .tf models with DLA and GPU, or only onnx?

You can use DLA as long as the layer are supported on DLA (see the compatibility matrix). If you have unsupported layers on DLA, you can use GPU fallback so that those layers run on GPU instead of DLA.

There are SSD samples that you can find under /usr/src/tensorrt/samples/ and /usr/src/tensorrt/samples/python that can show you the input/output data format of SSD.

Good, thank you @dusty_nv .

How can I try to use activate DLA with Jetson Inference, and change precision ?
=> I tried "net = jetson.inference.detectNet(“pednet”, threshold=0.5, device=“DLA_0”, precision=“FP16”)
" with no success… (Exception: jetson.inference – detectNet.init() failed to parse args tuple)

I could finally use the DLA by changing the device in the file detectNet.h (device=DEVICE_DLA_0) and recompiled jetson-inference, and this is what I get :

[TRT] native precisions detected for DLA_0: FP32, FP16, INT8
[TRT] selecting fastest native precision for DLA_0: INT8
[TRT] attempting to open engine cache file /usr/local/bin/networks/ped-100/snapshot_iter_70800.caffemodel.1.1.DLA_0 .INT8.engine
[TRT] cache file not found, profiling network model on device DLA_0
[TRT] device DLA_0, loading /usr/local/bin/networks/ped-100/deploy.prototxt /usr/local/bin/networks/ped-100/snapsho t_iter_70800.caffemodel
[TRT] retrieved Output tensor “coverage”: 1x32x64
[TRT] retrieved Output tensor “bboxes”: 4x32x64
[TRT] device DLA_0, configuring CUDA engine
[TRT] retrieved Input tensor “data”: 3x512x1024
[TRT] warning: device DLA_0 using INT8 precision with RANDOM calibration
[TRT] device DLA_0, building FP16: OFF
[TRT] device DLA_0, building INT8: ON
[TRT] device DLA_0, building CUDA engine (this may take a few minutes the first time a network is loaded)
[TRT] Network built for DLA requires kENTROPY_CALIBRATION_2 calibrator.
[TRT] Network validation failed.
[TRT] device DLA_0, failed to build CUDA engine
[TRT] device DLA_0, failed to load networks/ped-100/snapshot_iter_70800.caffemodel

Any idea of how I can do to make it work ?

I haven’t tried or tested the object detection models from jetson-inference on DLA, so they aren’t guaranteed to work.

However, have you tried FP16 first? Also you may want to enable GPU fallback.

YES it worked with FP16 (I’ve not checked the results in the output but TRT seems to be running without any trouble), thanks @dusty_nv !

Now I’m trying to modify the jetson-inference lib to be able to launch 3 processes (GPU+DLA_0+DLA_1) in parallele, just by adding parameters (device and precision) in jetson.inference.detectNet (I’m using python btw). Anyone already did that ? => I mainly need to modify PyDetectNet_Init in PyDetectNet.cpp located in python/binding dir, right ?

Yes, you could modify the bindings to add your own argument parsing from Python.

Another thing you will want to do is call net->CreateStream() in PyDetectNet_Init() after the detectNet object is created, like is done here:

I’ve changed the PyDetectNet_Init function but I have an error message (/PyDetectNet.cpp:536:3: error: ‘precisionType’ is not a member of ‘tensorNet’) :

static int PyDetectNet_Init( PyDetectNet_Object* self, PyObject *args, PyObject *kwds )
{
	printf(LOG_PY_INFERENCE "PyDetectNet_Init()\n");
	
	// parse arguments
	PyObject* argList     = NULL;
	const char* network   = "ssd-mobilenet-v2";
	float threshold       = DETECTNET_DEFAULT_THRESHOLD;
	const char* precision = "FP16";
	const char* device    = "GPU";

	static char* kwlist[] = {"network", "argv", "threshold", "precision", "device", NULL};

	if( !PyArg_ParseTupleAndKeywords(args, kwds, "|sOf", kwlist, &network, &argList, &threshold, &precision, &device))
	{
		PyErr_SetString(PyExc_Exception, LOG_PY_INFERENCE "detectNet.__init()__ failed to parse args tuple");
		return -1;
	}
    
	// determine whether to use argv or built-in network
	if( argList != NULL && PyList_Check(argList) && PyList_Size(argList) > 0 )
	{
		printf(LOG_PY_INFERENCE "detectNet loading network using argv command line params\n");

		// parse the python list into char**
		const size_t argc = PyList_Size(argList);

		if( argc == 0 )
		{
			PyErr_SetString(PyExc_Exception, LOG_PY_INFERENCE "detectNet.__init()__ argv list was empty");
			return -1;
		}

		char** argv = (char**)malloc(sizeof(char*) * argc);

		if( !argv )
		{
			PyErr_SetString(PyExc_MemoryError, LOG_PY_INFERENCE "detectNet.__init()__ failed to malloc memory for argv list");
			return -1;
		}

		for( size_t n=0; n < argc; n++ )
		{
			PyObject* item = PyList_GetItem(argList, n);
			
			if( !PyArg_Parse(item, "s", &argv[n]) )
			{
				PyErr_SetString(PyExc_Exception, LOG_PY_INFERENCE "detectNet.__init()__ failed to parse argv list");
				return -1;
			}

			printf(LOG_PY_INFERENCE "detectNet.__init__() argv[%zu] = '%s'\n", n, argv[n]);
		}

		// load the network using (argc, argv)
		self->net = detectNet::Create(argc, argv);

		// free the arguments array
		free(argv);
	}
	else
	{
		printf(LOG_PY_INFERENCE "detectNet loading build-in network '%s'\n", network);
		
		// parse the selected built-in network
		detectNet::NetworkType networkType = detectNet::NetworkTypeFromStr(network);

		tensorNet::precisionType precision = tensorNet::precisionTypeFromStr(precision);
		tensorNet::deviceType device = tensorNet::deviceTypeFromStr(device);

		
		if( networkType == detectNet::CUSTOM )
		{
			PyErr_SetString(PyExc_Exception, LOG_PY_INFERENCE "detectNet invalid built-in network was requested");
			printf(LOG_PY_INFERENCE "detectNet invalid built-in network was requested ('%s')\n", network);
			return -1;
		}
		
		// load the built-in network
		self->net = detectNet::Create(networkType, threshold, precisionType, deviceType);
	}

	// confirm the network loaded
	if( !self->net )
	{
		PyErr_SetString(PyExc_Exception, LOG_PY_INFERENCE "detectNet failed to load network");
		printf(LOG_PY_INFERENCE "detectNet failed to load built-in network '%s'\n", network);
		return -1;
	}

	self->base.net = self->net;
	return 0;
}

Any idea of how to solve it ?
Also I was thinking of multiproccessing for running the 3 inferencers (GPU+2xDLA) instead of net->CreateStream(), what do you think ?

precisionType isn’t actually defined inside tensorNet class, it is global. So you want to remove tensorNet:: that are in front of here:

tensorNet::precisionType precision = tensorNet::precisionTypeFromStr(precision);
tensorNet::deviceType device = tensorNet::deviceTypeFromStr(device);

// should be ->
precisionType precision = precisionTypeFromStr(precision);
deviceType device = deviceTypeFromStr(device);

That may be easier to do multiprocessing, as long as you don’t need to share the GPU data between processes without copying.

@dusty_nv thanks for your help, after a lot of time spent to understand and test, it works now :)

I get arround 36 FPS with pednet (512x1024 frames), wich is equivalent to arround 200 FPS in 300x300 (~19MPx/s). not so bad, but far from the 850FPS I got with mobilenet SSD V1 in jetson-benchmarks ! It seems that the GPU is able of 28 FPS (14,7 MPx/s) and the DLAs are about ~4FPS (2MPx/s, when all are running together).
My configuration :
GPU : Pednet INT8
DLA_0 : Pednet FP16
DLA_1 : Pednet FP16
Script in python3 using multiprocessing, cropping big frames of the camera in multiple thumbnails of 512x1024 sent to the inferencers via python queues. What do you think of that result ?

The pednet model is based on the older, slower DetectNet architecture. It is not SSD-Mobilenet.

I could get arround 90FPS (8,1MPx/s) with mobilenet SSD on the GPU, not more (compared to 25FPS with pednet => 13,1MPx/s). Still trying to improve the results, do you know if there is a caffemodel of mobilenet SSD trained for pedestrians only (and with higher input size also) ?

Hi @Pelepicier, sorry for the delay - the SSD-Mobilenet models used in jetson-inference aren’t caffemodels, they are TensorFlow (UFF) and ONNX (PyTorch). The ONNX models are the ones that are re-trainable in the tutorial. So you could re-train it only on pedestrians and check the performance after an epoch.

You can also use the production peoplenet model with DeepStream: https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet