Tiny YOLOv4 TensorRT - too many resources requested for launch on 4GB Nano

paul55 · December 17, 2020, 1:45pm

I’m trying to run a TensorRT Tiny YOLOv4 model on a 4GB Nano development board.

I’m running the Nano headless, so should have enough memory (details in jtop output below). And, the model inference definitely works fine on my MacBook CPU.

But, whenever I try to run the inference script on the Nano (available here), I get the following too many resources requested error:

2020-12-16 20:01:04.688877: F tensorflow/core/kernels/resize_bilinear_op_gpu.cu.cc:493] Non-OK-status: GpuLaunchKernel(kernel, config.block_count, config.thread_per_block, 0, d.stream(), config.virtual_thread_count, images.data(), height_scale, width_scale, batch, in_height, in_width, channels, out_height, out_width, output.data()) status: Internal: too many resources requested for launch
Fatal Python error: Aborted

Thread 0x0000007f876e6010 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60 in quick_execute
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550 in call
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924 in _call_flat
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py", line 106 in _call_flat
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1722 in _call_with_flat_signature
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1673 in _call_impl
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1655 in __call__
  File "detect_video.py", line 92 in main
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251 in _run_main
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303 in run
  File "detect_video.py", line 124 in <module>
Aborted (core dumped)

I’ve read elsewhere (e.g. here) that this error might relate to CUDA’s maximum number of threads per block being too large. According to deviceQuery (see below) it’s set to 1024. But I’m not sure

a) whether this is the problem, or

b) how to go about reducing the max threads per block (e.g. to 512) if it is the problem.

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3964 MBytes (4156682240 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

I’m very much a beginner with Python/Tensorflow/Jetson, so would really appreciate some help to get my inference running.

AastaLLL · December 18, 2020, 3:01am

Hi,

“too many resources requested for launch” indicates that a kernel launch is requesting resources that can never be satisfied by the current device.
Requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks.

Suppose you are using TensorFlow v1.15.x. Based on the source below:

github.com

tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/resize_bilinear_op_gpu.cu.cc#L388


      
                    std::is_same<float, T>::value) {
                  int num_threads_per_pixel =
                      std::min(max_num_threads_per_pixel, channels / channels_per_thread);
                  config = GetGpuLaunchConfig(
                      out_height * out_width * num_threads_per_pixel, d);
                  config.virtual_thread_count = num_threads_per_pixel;
                  kernel = ResizeBilinearKernel_faster<T>;
                }
              }
          
          
    TF_CHECK_OK(
                  GpuLaunchKernel(kernel, config.block_count, config.thread_per_block, 0,
                                  d.stream(), config.virtual_thread_count, images.data(),
                                  height_scale, width_scale, batch, in_height, in_width,
                                  channels, out_height, out_width, output.data()));
            }
          };
          
          
// Partial specialization of ResizeBilinearGrad functor for a GPUDevice.
          template <typename T>
          struct ResizeBilinearGrad<GPUDevice, T> {

It seems you can modify the block/thread number by adjusting the config.block_count and config.thread_per_block value.
We have a similar patch for PyTorch below for your reference:
https://gist.github.com/dusty-nv/ce51796085178e1f38e3c6a1663a93a1#file-pytorch-1-7-jetpack-4-4-1-patch

Thanks.

paul55 · December 18, 2020, 10:38am

Thanks very much!

Apologies, but I’m still not clear on how exactly I could adjust the config.block_count and config.thread_per_block values? Neither Googling ‘how to set “config.block_count” tensorflow’, or searching ‘config.block_count’ on tensorflow.org revealed anything useful.

I’m using Tensorflow 2.3.0, so assume this is the relevant script for that version.

But beyond that, I’m not sure what to do next.

AastaLLL · December 29, 2020, 5:28am

Hi,

Sorry for the late update.

May I know which TensorFlow package do you install?
Do you use our official release listed here?

The variable can be set via the YOLOv4 script or some pre-defined value at the building time.
In general, our release should already update the pre-defined value based on the Jetson hardware resource.
So if you install it, please check how to set the configure within the YOLOv4 directly.

Thanks.

paul55 · December 29, 2020, 8:56am

Thanks - no problem.

I’m using your official release. It was installed using:

sudo pip3 install --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v44 tensorflow==2.3.0+nv20.09

I found the Nvidia container number here.

Given the error I’m getting, it seems the release didn’t update based on the available Jetson hardware resource. Could you explain how to set the variables manually at build time please?

AastaLLL · December 30, 2020, 6:08am

Hi,

In general, we test it for Xavier (NX), TX2, and Nano.
So Jetson based package should already adjust the resources for the platform accordingly.

Let us try the script in our environment first.
We will get back to you once we make any progress.

Thanks.

paul55 · December 30, 2020, 8:45am

Great - thanks

paul55 · January 8, 2021, 11:12am

Adding the following lines to the top of the detect script (before TensorFlow imports) resolves the issue:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

Thanks to @paul.damsa for the solution!

AastaLLL · January 14, 2021, 7:17am

Good to know this!
Thanks for the feedback.

waleed5461 · March 10, 2021, 7:07am

but this command disables the gpu

565463192 · July 1, 2021, 4:02am

Excuse me, I don’t understand. If you use the above command, isn’t it just using the CPU to run the program? (For jet son NANO) @paul55