Hello,
I’m trying to run a sample inference with NVCaffe on Drive PX 2 with dGPU, but for some reason I can’t push the GPU to work efficiently. The load stays on 0% (also when running on iGPU). GPU is definitely doing something as running the same on CPU is even slower, also nvperf shows GPU activities. When I run sample TensorFlow code I can easily get 99% GPU utilization.
Initially I though it’s an overhead on transferring data to GPU, so in the sample below I’m pushing a random array only once, and run forward() in a loop, still not saturating the GPU. What’s wrong?
Model, for example from:
wget https://raw.githubusercontent.com/C-Aniruddh/realtime_object_recognition/master/MobileNetSSD_deploy.prototxt.txt
wget https://github.com/C-Aniruddh/realtime_object_recognition/raw/master/MobileNetSSD_deploy.caffemodel
Test code:
import numpy as np
import time
import caffe
from imutils.video import FPS
caffe.set_mode_gpu()
caffe.set_device(0)
print("Loading model...")
net = caffe.Net('MobileNetSSD_deploy.prototxt.txt', 'MobileNetSSD_deploy.caffemodel', caffe.TEST)
blob = np.random.rand(1,3,300,300)
net.blobs['data'].data[...] = blob
frame_count = 1
fps = FPS().start()
print("Starting inference...")
while frame_count<=100:
print(frame_count)
frame_count = frame_count + 1
detections = net.forward()
# update the FPS counter
fps.update()
fps.stop()
print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
print("Done")
Log:
python3 caffe_dpx.py
...
99
100
[INFO] elapsed time: 13.70
[INFO] approx. FPS: 7.30
Done
My setup - nvcaffe built from https://github.com/NVIDIA/caffe.git
>>> import caffe
>>> caffe.__version__
'0.17.2'
Makefile.config for NVCaffe:
## Refer to http://caffe.berkeleyvision.org/installation.html
# Contributions simplifying and improving our build system are welcome!
# cuDNN acceleration switch (uncomment to build with cuDNN).
# cuDNN version 6 or higher is required.
USE_CUDNN := 1
# NCCL acceleration switch (uncomment to build with NCCL)
# See https://github.com/NVIDIA/nccl
# USE_NCCL := 1
# Builds tests with 16 bit float support in addition to 32 and 64 bit.
# TEST_FP16 := 1
# uncomment to disable IO dependencies and corresponding data layers
# USE_OPENCV := 0
# USE_LEVELDB := 0
# USE_LMDB := 0
# Uncomment if you're using OpenCV 3
OPENCV_VERSION := 3
# To customize your choice of compiler, uncomment and set the following.
# N.B. the default for Linux is g++ and the default for OSX is clang++
# CUSTOM_CXX := g++
# CUDA directory contains bin/ and lib/ directories that we need.
CUDA_DIR := /usr/local/cuda
# On Ubuntu 14.04, if cuda tools are installed via
# "sudo apt-get install nvidia-cuda-toolkit" then use this instead:
# CUDA_DIR := /usr
# CUDA architecture setting: going with all of them.
CUDA_ARCH := -gencode arch=compute_50,code=sm_50 \
-gencode arch=compute_52,code=sm_52 \
-gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_61,code=sm_61 \
-gencode arch=compute_61,code=compute_61 \
-gencode arch=compute_62,code=sm_62 \
-gencode arch=compute_62,code=compute_62
# BLAS choice:
# atlas for ATLAS
# mkl for MKL
# open for OpenBlas - default, see https://github.com/xianyi/OpenBLAS
# BLAS := open
BLAS := atlas
# Custom (MKL/ATLAS/OpenBLAS) include and lib directories.
# BLAS_INCLUDE := /opt/OpenBLAS/include/
# BLAS_LIB := /opt/OpenBLAS/lib/
# Homebrew puts openblas in a directory that is not on the standard search path
# BLAS_INCLUDE := $(shell brew --prefix openblas)/include
# BLAS_LIB := $(shell brew --prefix openblas)/lib
# This is required only if you will compile the matlab interface.
# MATLAB directory should contain the mex binary in /bin.
# MATLAB_DIR := /usr/local
# MATLAB_DIR := /Applications/MATLAB_R2012b.app
# NOTE: this is required only if you will compile the python interface.
# We need to be able to find Python.h and numpy/arrayobject.h.
# PYTHON_INCLUDE := /usr/include/python2.7 \
# /usr/lib/python2.7/dist-packages/numpy/core/include
# Anaconda Python distribution is quite popular. Include path:
# Verify anaconda location, sometimes it's in root.
# ANACONDA_HOME := $(HOME)/anaconda
# PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
# $(ANACONDA_HOME)/include/python2.7 \
# $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \
# Uncomment to use Python 3 (default is Python 2)
PYTHON_LIBRARIES := boost_python-py35 python3.5m
PYTHON_INCLUDE := /usr/include/python3.5m \
/usr/local/lib/python3.5/dist-packages/numpy/core/include
# We need to be able to find libpythonX.X.so or .dylib.
PYTHON_LIB := /usr/lib
# PYTHON_LIB := $(ANACONDA_HOME)/lib
# Homebrew installs numpy in a non standard path (keg only)
# PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.__file__)'))/include
# PYTHON_LIB += $(shell brew --prefix numpy)/lib
# Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1
# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/aarch64-linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial
# If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies
# INCLUDE_DIRS += $(shell brew --prefix)/include
# LIBRARY_DIRS += $(shell brew --prefix)/lib
# Uncomment to use `pkg-config` to specify OpenCV library paths.
# (Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)
# USE_PKG_CONFIG := 1
BUILD_DIR := build
DISTRIBUTE_DIR := distribute
# Uncomment for debugging. Does not work on OSX due to https://github.com/BVLC/caffe/issues/171
# DEBUG := 1
# The ID of the GPU that 'make runtest' will use to run unit tests.
TEST_GPUID := 0
# enable pretty build (comment to see full commands)
Q ?= @
# shared object suffix name to differentiate branches
LIBRARY_NAME_SUFFIX := -nv
GPU info:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "DRIVE PX 2 AutoChauffeur"
CUDA Driver Version / Runtime Version 10.0 / 9.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 3840 MBytes (4026466304 bytes)
( 9) Multiprocessors, (128) CUDA Cores/MP: 1152 CUDA Cores
GPU Max Clock rate: 1290 MHz (1.29 GHz)
Memory Clock rate: 3003 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA Tegra X2"
CUDA Driver Version / Runtime Version 10.0 / 9.2
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 6389 MBytes (6699651072 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1275 MHz (1.27 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from DRIVE PX 2 AutoChauffeur (GPU0) -> NVIDIA Tegra X2 (GPU1) : No
> Peer access from NVIDIA Tegra X2 (GPU1) -> DRIVE PX 2 AutoChauffeur (GPU0) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 9.2, NumDevs = 2
Result = PASS