NVCaffe - can't load GPU on inference

Hello,

I’m trying to run a sample inference with NVCaffe on Drive PX 2 with dGPU, but for some reason I can’t push the GPU to work efficiently. The load stays on 0% (also when running on iGPU). GPU is definitely doing something as running the same on CPU is even slower, also nvperf shows GPU activities. When I run sample TensorFlow code I can easily get 99% GPU utilization.

Initially I though it’s an overhead on transferring data to GPU, so in the sample below I’m pushing a random array only once, and run forward() in a loop, still not saturating the GPU. What’s wrong?

Model, for example from:

wget https://raw.githubusercontent.com/C-Aniruddh/realtime_object_recognition/master/MobileNetSSD_deploy.prototxt.txt
wget https://github.com/C-Aniruddh/realtime_object_recognition/raw/master/MobileNetSSD_deploy.caffemodel

Test code:

import numpy as np
import time
import caffe
from imutils.video import FPS

caffe.set_mode_gpu()
caffe.set_device(0)

print("Loading model...")

net = caffe.Net('MobileNetSSD_deploy.prototxt.txt', 'MobileNetSSD_deploy.caffemodel', caffe.TEST)

blob = np.random.rand(1,3,300,300)

net.blobs['data'].data[...] = blob

frame_count = 1
fps = FPS().start()

print("Starting inference...")
while frame_count<=100:
        print(frame_count)
        frame_count = frame_count + 1

        detections = net.forward()

        # update the FPS counter
        fps.update()

fps.stop()
print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
print("Done")

Log:

python3 caffe_dpx.py

...

99
100
[INFO] elapsed time: 13.70
[INFO] approx. FPS: 7.30
Done

My setup - nvcaffe built from https://github.com/NVIDIA/caffe.git

>>> import caffe
>>> caffe.__version__
'0.17.2'

Makefile.config for NVCaffe:

## Refer to http://caffe.berkeleyvision.org/installation.html
# Contributions simplifying and improving our build system are welcome!

# cuDNN acceleration switch (uncomment to build with cuDNN).
# cuDNN version 6 or higher is required.
USE_CUDNN := 1

# NCCL acceleration switch (uncomment to build with NCCL)
# See https://github.com/NVIDIA/nccl
# USE_NCCL := 1

# Builds tests with 16 bit float support in addition to 32 and 64 bit.
# TEST_FP16 := 1

# uncomment to disable IO dependencies and corresponding data layers
# USE_OPENCV := 0
# USE_LEVELDB := 0
# USE_LMDB := 0

# Uncomment if you're using OpenCV 3
OPENCV_VERSION := 3

# To customize your choice of compiler, uncomment and set the following.
# N.B. the default for Linux is g++ and the default for OSX is clang++
# CUSTOM_CXX := g++

# CUDA directory contains bin/ and lib/ directories that we need.
CUDA_DIR := /usr/local/cuda
# On Ubuntu 14.04, if cuda tools are installed via
# "sudo apt-get install nvidia-cuda-toolkit" then use this instead:
# CUDA_DIR := /usr

# CUDA architecture setting: going with all of them.
CUDA_ARCH :=    -gencode arch=compute_50,code=sm_50 \
                -gencode arch=compute_52,code=sm_52 \
                -gencode arch=compute_60,code=sm_60 \
                -gencode arch=compute_61,code=sm_61 \
                -gencode arch=compute_61,code=compute_61 \
                -gencode arch=compute_62,code=sm_62 \
                -gencode arch=compute_62,code=compute_62

# BLAS choice:
# atlas for ATLAS
# mkl for MKL
# open for OpenBlas - default, see https://github.com/xianyi/OpenBLAS
# BLAS := open
BLAS := atlas
# Custom (MKL/ATLAS/OpenBLAS) include and lib directories.
# BLAS_INCLUDE := /opt/OpenBLAS/include/
# BLAS_LIB := /opt/OpenBLAS/lib/

# Homebrew puts openblas in a directory that is not on the standard search path
# BLAS_INCLUDE := $(shell brew --prefix openblas)/include
# BLAS_LIB := $(shell brew --prefix openblas)/lib

# This is required only if you will compile the matlab interface.
# MATLAB directory should contain the mex binary in /bin.
# MATLAB_DIR := /usr/local
# MATLAB_DIR := /Applications/MATLAB_R2012b.app

# NOTE: this is required only if you will compile the python interface.
# We need to be able to find Python.h and numpy/arrayobject.h.
# PYTHON_INCLUDE := /usr/include/python2.7 \
#               /usr/lib/python2.7/dist-packages/numpy/core/include
# Anaconda Python distribution is quite popular. Include path:
# Verify anaconda location, sometimes it's in root.
# ANACONDA_HOME := $(HOME)/anaconda
# PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
                # $(ANACONDA_HOME)/include/python2.7 \
                # $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \

# Uncomment to use Python 3 (default is Python 2)
PYTHON_LIBRARIES := boost_python-py35 python3.5m
PYTHON_INCLUDE := /usr/include/python3.5m \
                 /usr/local/lib/python3.5/dist-packages/numpy/core/include

# We need to be able to find libpythonX.X.so or .dylib.
PYTHON_LIB := /usr/lib
# PYTHON_LIB := $(ANACONDA_HOME)/lib

# Homebrew installs numpy in a non standard path (keg only)
# PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.__file__)'))/include
# PYTHON_LIB += $(shell brew --prefix numpy)/lib

# Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1

# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/aarch64-linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial

# If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies
# INCLUDE_DIRS += $(shell brew --prefix)/include
# LIBRARY_DIRS += $(shell brew --prefix)/lib

# Uncomment to use `pkg-config` to specify OpenCV library paths.
# (Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)
# USE_PKG_CONFIG := 1

BUILD_DIR := build
DISTRIBUTE_DIR := distribute

# Uncomment for debugging. Does not work on OSX due to https://github.com/BVLC/caffe/issues/171
# DEBUG := 1

# The ID of the GPU that 'make runtest' will use to run unit tests.
TEST_GPUID := 0

# enable pretty build (comment to see full commands)
Q ?= @

# shared object suffix name to differentiate branches
LIBRARY_NAME_SUFFIX := -nv

GPU info:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "DRIVE PX 2 AutoChauffeur"
  CUDA Driver Version / Runtime Version          10.0 / 9.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 3840 MBytes (4026466304 bytes)
  ( 9) Multiprocessors, (128) CUDA Cores/MP:     1152 CUDA Cores
  GPU Max Clock rate:                            1290 MHz (1.29 GHz)
  Memory Clock rate:                             3003 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          10.0 / 9.2
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 6389 MBytes (6699651072 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from DRIVE PX 2 AutoChauffeur (GPU0) -> NVIDIA Tegra X2 (GPU1) : No
> Peer access from NVIDIA Tegra X2 (GPU1) -> DRIVE PX 2 AutoChauffeur (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 9.2, NumDevs = 2
Result = PASS

Hi Dariusz,

please count the theoretical FLOPS of the net to do a correct estimation of the ALU utilization. Due to strength-reduced convolutions you might get a distorted result.

  • Fabian

Thank you Fabian for the tip! But leaving alone the optimal GPU utilization - I’d love to see any load on the iGPU; it’s weird that it stays on 0%. I can see some load even with Caffe2 models, but never with Caffe/NVCaffe.

Hi Dariusz,

I need to replicate your case on my bench and will let you know more information as soon as I have a solution.

  • Fabian

Thank you in advance!
If you need some more info - just ask here. And here’s the way I built NVCaffe:

# Install dependencies
sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-dev libhdf5-serial-dev protobuf-compiler
sudo apt-get install --no-install-recommends libboost-all-dev
sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev
sudo apt-get install libatlas-base-dev libopenblas-dev

# Install libturbojpeg and fix the link, according to https://github.com/NVIDIA/caffe/
sudo apt-get install libturbojpeg
sudo ln -sr /usr/lib/aarch64-linux-gnu/libturbojpeg.so.0.1.0 /usr/lib/aarch64-linux-gnu/libturbojpeg.so

# Clone the sources
git clone https://github.com/NVIDIA/caffe.git
cd caffe
git checkout v0.17.2

# Configure the build
# Get the Makefile.config pasted in my first post above
cp Makefile.config.example Makefile.config
vi Makefile.config

# Build Caffe and the tests
make -j6 all
make -j6 test
make runtest

# Build and install leveldb
cd ..
mkdir leveldb
cd leveldb/
wget https://pypi.python.org/packages/03/98/1521e7274cfbcc678e9640e242a62cbcd18743f9c5761179da165c940eac/leveldb-0.20.tar.gz
tar xvzf leveldb-0.20.tar.gz
cd leveldb-0.20/
python3 setup.py build
sudo python3 setup.py install

# Install remaining dependencies and build pycaffe
cd ../../caffe/
pkgs=`sed 's/[>=<].*$//' python/requirements.txt`
for pkg in $pkgs; do sudo pip3 install $pkg; done
make -j6 pycaffe

# Add pycaffe to PYTHONPATH
export PYTHONPATH=/home/nvidia/caffe/python

Hi Dariusz,

did you compile OpenCV as written I have written here [url]https://devtalk.nvidia.com/default/topic/1044512/general/opencv-unable-to-stop-the-stream-inappropriate-ioctl-for-device/?offset=2#5299842[/url]?

  • Fabian

I compiled OpenCV (3.4.4) natively on the device. I’m posting the steps here. But in the sample code form my first post in this thread I’m not using OpenCV at all, just some random numpy array. Does it matter then?

# Install dependencies
sudo apt-get install cmake \
    pkg-config \
    python3-dev \
    python3-numpy \
    python3-py \
    python3-pytest \
    python-dev \
    python-numpy \
    python-py \
    python-pytest \
    ffmpeg \
    libboost-all-dev \
    libjpeg-dev \
    libpng-dev \
    libtiff-dev \
    libavcodec-dev \
    libavformat-dev \
    libswscale-dev \
    libv4l-dev \
    v4l-utils \
    libxvidcore-dev \
    libx264-dev \
    libx265-dev \
    libvpx-dev \
    libgtk-3-dev \
    libatlas-base-dev \
    libgstreamer1.0-dev \
    libgstreamer-plugins-base1.0-dev \
    libdc1394-22-dev \
    libavresample-dev \
    gfortran
	
# Fetch sources and configure the build
git clone https://github.com/opencv/opencv.git
cd opencv/
git checkout 3.4.4
 
cd ..
git clone https://github.com/opencv/opencv_contrib.git
cd opencv_contrib
git checkout 3.4.4
 
cd ../opencv
mkdir build
cd build/
 
    
cmake -D CMAKE_BUILD_TYPE=RELEASE \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D BUILD_NEW_PYTHON_SUPPORT=ON \
    -D BUILD_EXAMPLES=ON \
    -D INSTALL_PYTHON_EXAMPLES=ON \
    -D INSTALL_C_EXAMPLES=ON \
    -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules \
    -D WITH_CUDA=ON \
    -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
    -D CUDA_ARCH_BIN=6.1 6.2 7.2 7.5 \
    -D CUDA_ARCH_PTX="" \
    -D ENABLE_FAST_MATH=ON \
    -D CUDA_FAST_MATH=ON \
    -D WITH_FFMPEG=ON \
    -D WITH_CUBLAS=ON \
    -D WITH_LIBV4L=ON \
    -D WITH_GTK=ON \
    -D WITH_GSTREAMER=ON \
    -D WITH_GSTREAMER_0_10=OFF \
    -D WITH_TBB=ON \
    ../
	
# Execute the build
time make -j6
sudo make install

# Configure ldconfig
sudo sh -c 'echo "/usr/local/lib" > /etc/ld.so.conf.d/opencv.conf'
sudo ldconfig

Hi Dariusz,

I am still investigating and will come back to you as soon as I have an update on that.

  • Fabian

Thank you for your time and keeping me updated!

Hi Dariusz,

just a quick vital sign: I expect to have an update Monday/Tuesday next week.

  • Fabian

Great, I really appreciate that you keep me updated, thank you!

Hi Dariusz,

when following your guide during make runtest I get the following:

*** Error in `.build_release/tools/caffe': free(): invalid pointer: 0x0000000000446110 ***
*** Aborted at 1548775285 (unix time) try "date -d @1548775285" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e90000242c) received by PID 9260 (TID 0x7f7cea5000) from PID 9260; stack trace: ***
    @       0x7fa021e6c0 ([vdso]+0x6bf)
    @       0x7f9ddf6528 gsignal
Makefile:560: recipe for target 'runtest' failed
make: *** [runtest] Aborted (core dumped)

Did this error occur in your build as well? It looks like there are some x86 intrinsics on arm64. How did you resolve this?

  • Fabian

Yes, I was hit by this as well. I fixed it by installing libtcmalloc-minimal4 and making sure it’s preloaded before running the code.
You can make a temporary export of LD_PRELOAD as below, or configure it permanently, e.g. in /etc/profile.d/ld_preload.sh

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

Hi,

thanks for providing more information on how to get rid of this bug. Still I am quite worried about why this required tcmalloc libs to be installed, since all other memory allocators should do fine as well. The information on how to free an object should be hidden to the user…

Anyhow, let me come back to your issue. On my bench it works flawlessly so my question is: How exactly do you measure the GPU utilization?

Fabian

I measured GPU utilization with

sudo tegrastats

after forcing the model to run on iGPU (I know that tegrastats doesn’t show the load on dGPU).

Hi,

as written here already [url]https://devtalk.nvidia.com/default/topic/1044159/general/always-get-0-of-gpu-usage-from-tegrastats-in-sdk-5-0-5-13-px2/[/url] the tegrastats utility does not work in our SDK/PDK version.

Therefore, please use nvprof to measure the performance and utilization.

Fabian

Yes, I know that. That’s why I’m saying I used iGPU, where tegrastats does work. If you double-check my first post in this thread, you will notice my comment:
When I run sample TensorFlow code I can easily get 99% GPU utilization.

We recommend you using nvprof for both measuring iGPU as well as dGPU utilization.

The fact tegrastats is showing you 99% for iGPU or dGPU utilization does have arbitrary reasons, so do not take them as granted.

What makes me think is why would you run Tensorflow instead of TensorRT on DPX2? The DPX2 is not meant to do training on the platform and if you would want to use any custom layers, just insert them as custom plugins into TRT.

Fabian

I can try checking with nvprof, but it would be weird that tegrastats shows load on TensorFlow but not on NVCaffe when both run on the same iGPU.

The reason I’m using pure TensorFlow models on DPX2 is that I experience etremely long loading times of TensorRT ones there, as I described at https://devtalk.nvidia.com/default/topic/1046492/tensorrt/extremely-long-time-to-load-trt-optimized-frozen-tf-graphs/ Once the model is loaded the inference is indeed faster than pure TensorFlow model, but I can’t wait 10 minutes on each execution of the program… Anyway, that’s to be discussed in that another thread.