context or other operations about cuda is blocking ?

I want to do C++ multithreading concurrency for infering one engine. But it seems that context or other operations about cuda is blocking , but I am not sure. And the time consumption is almost doubled when increasing thread number double.

The code is as below:

IExecutionContext* context = engine->createExecutionContext();
        cudaStream_t stream;
        CHECK(cudaStreamCreate(&stream));
        CHECK(cudaMemcpyAsync(buffers[inputIndex], inputArray, BATCH_SIZE * INPUT_H * INPUT_W * 3 * sizeof(float), cudaMemcpyHostToDevice, stream));
        context->enqueue(BATCH_SIZE, buffers, stream, nullptr);
        CHECK(cudaMemcpyAsync(outputArray, buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
        cudaStreamSynchronize(stream);

Can you provide some documents about how to design multithreading code for concurrency?

Moving to Cuda Performance for support coverage.

Move to TRT

Hello,

To help us debug this issue, can you share a small repro that demonstrates the performance degrade when more threads are launched? I don’t see thread launch/management code in your sample, please include it as well.

regards,
NVIDIA Enterprise Support

@NVES sorry to reply late. I upload a small repohttps://github.com/IvyGongoogle/test-trt, please check it. you can modify the NUM_THREADS in line 5 in main.cpp.

hope your reply.

Hello,

I’m getting the following compile errors
root@f35f5b0dbe73:/mnt/test-trt# make

g++ -c main.cpp -g -Wall -std=c++11 -O2 -I./include -I/usr/local/cuda/include -I/usr/local/include/opencv4
main.cpp:9:13: error: 'string' is not a member of 'cv'
 std::vector<cv::string> imgsName;
             ^
main.cpp:9:13: note: suggested alternatives:
In file included from /usr/include/c++/5/string:39:0,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from ./include/tensorrtNet.hpp:3,
                 from main.cpp:3:
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
   typedef basic_string<char>    string;
                                 ^
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
main.cpp:9:13: error: 'string' is not a member of 'cv'
 std::vector<cv::string> imgsName;
             ^
main.cpp:9:13: note: suggested alternatives:
In file included from /usr/include/c++/5/string:39:0,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from ./include/tensorrtNet.hpp:3,
                 from main.cpp:3:
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
   typedef basic_string<char>    string;
                                 ^
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
main.cpp:9:23: error: template argument 1 is invalid
 std::vector<cv::string> imgsName;
                       ^
main.cpp:9:23: error: template argument 2 is invalid
main.cpp:11:85: error: 'string' is not a member of 'cv'
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
                                                                                     ^
main.cpp:11:85: note: suggested alternatives:
In file included from /usr/include/c++/5/string:39:0,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from ./include/tensorrtNet.hpp:3,
                 from main.cpp:3:
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
   typedef basic_string<char>    string;
                                 ^
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
main.cpp:11:85: error: 'string' is not a member of 'cv'
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
                                                                                     ^
main.cpp:11:85: note: suggested alternatives:
In file included from /usr/include/c++/5/string:39:0,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from ./include/tensorrtNet.hpp:3,
                 from main.cpp:3:
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
   typedef basic_string<char>    string;
                                 ^
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
main.cpp:11:95: error: template argument 1 is invalid
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
                                                                                               ^
main.cpp:11:95: error: template argument 2 is invalid
main.cpp: In function 'void parseImgDir(const string&, std::vector<cv::Mat>&, int&)':
main.cpp:28:42: error: request for member 'push_back' in 'imgsName', which is of non-class type 'int'
                                 imgsName.push_back(imgName);
                                          ^
main.cpp: In function 'void thread_task(const char*, int, int)':
main.cpp:54:65: error: invalid types 'int[int]' for array subscript
           trtNetPtr->inference(imgs[imgIndex], imgsName[imgIndex], time_preprocess, time_swithContext, time_pureInfer, time_destroy);
                                                                 ^
Makefile:16: recipe for target 'main.o' failed
make: *** [main.o] Error 1
root@f35f5b0dbe73:/mnt/test-trt#

@NVES please modify

std::vector<cv::string> imgsName;

to

std::vector<std::string> imgsName;

in line 9 in main.cpp, and you can also get this modification in the latest repohttps://github.com/IvyGongoogle/test-trt.
thanks.

getting following error. I recommend building your application in a docker container to isolate any dependency issues.

root@3fbbd400a988:/mnt/test-trt# make
g++ -c main.cpp -g -Wall -std=c++11 -O2 -I./include -I/usr/local/cuda/include -I/usr/local/include/opencv4
main.cpp:11:85: error: 'string' is not a member of 'cv'
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
                                                                                     ^
main.cpp:11:85: note: suggested alternatives:
In file included from /usr/include/c++/5/string:39:0,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from ./include/tensorrtNet.hpp:3,
                 from main.cpp:3:
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
   typedef basic_string<char>    string;
                                 ^
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
main.cpp:11:85: error: 'string' is not a member of 'cv'
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
                                                                                     ^
main.cpp:11:85: note: suggested alternatives:
In file included from /usr/include/c++/5/string:39:0,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from ./include/tensorrtNet.hpp:3,
                 from main.cpp:3:
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
   typedef basic_string<char>    string;
                                 ^
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
/usr/include/c++/5/bits/stringfwd.h:74:33: note:   'std::__cxx11::string'
main.cpp:11:95: error: template argument 1 is invalid
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
                                                                                               ^
main.cpp:11:95: error: template argument 2 is invalid
main.cpp: In function 'void parseImgDir(const string&, std::vector<cv::Mat>&, int&)':
main.cpp:28:42: error: request for member 'push_back' in 'imgsName', which is of non-class type 'int'
                                 imgsName.push_back(imgName);
                                          ^
main.cpp: In function 'int main(int, char**)':
main.cpp:76:44: error: invalid initialization of reference of type 'int&' from expression of type 'std::vector<std::__cxx11::basic_string<char> >'
         parseImgDir(argv[2], imgs, imgsName);
                                            ^
main.cpp:11:6: note: in passing argument 3 of 'void parseImgDir(const string&, std::vector<cv::Mat>&, int&)'
 void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){
      ^
Makefile:16: recipe for target 'main.o' failed
make: *** [main.o] Error 1
root@3fbbd400a988:/mnt/test-trt#

@NVES sorry, it only is a namespace error, please modify

void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<cv::string>& imgsName){

to

void parseImgDir(const std::string& imgDir, std::vector<cv::Mat>& imgs, std::vector<std::string>& imgsName){

in line 11 in main.cpp again.

thanks.

moved past that, now seeing:

root@92cbe228f45f:/mnt/test-trt# make
g++ -c main.cpp -g -Wall -std=c++11 -O2 -I./include -I/usr/local/cuda/include -I/usr/local/include/opencv4
g++ -c tensorrtNet.cpp -g -Wall -std=c++11 -O2 -I./include -I/usr/local/cuda/include -I/usr/local/include/opencv4
g++ -o main main.o tensorrtNet.o -g -Wall -std=c++11 -O2 -I./include -I/usr/local/cuda/include -I/usr/local/include/opencv4 -L./lib -L/search/odin/gongzhenting/local/gcc-6.1.0/lib64 -L/usr/local/cuda-10.0/targets/x86_64-linux/lib/ -L/usr/local/lib -lcudnn -lcublas -lcudart -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_imgcodecs -lpthread -lnvinfer -lnvparsers
root@92cbe228f45f:/mnt/test-trt# sh run.sh
TestData/1_(49).jpg
ERROR: ERROR: ERROR: UFFParser: Invalid UFF file, cannot be opened
UFFParser: Invalid UFF file, cannot be opened
ERROR: UFFParser: Invalid UFF file, cannot be opened
UFFParser: Invalid UFF file, cannot be opened
ERROR: UFFParser: Invalid UFF file, cannot be opened
ERROR: ERROR: tensorrtNet: Fail to parseERROR: ERROR: tensorrtNet: Fail to parse
tensorrtNet: Fail to parseWhoops, Unable to create engine with INPUT_H x INPUT_W:
1376 x 800
tensorrtNet: Fail to parseERROR:
Whoops, Unable to create engine with INPUT_H x INPUT_W: tensorrtNet: Fail to parse
Whoops, Unable to create engine with INPUT_H x INPUT_W: Whoops, Unable to create engine with INPUT_H x INPUT_W: 1376 x 800

1376Whoops, Unable to create engine with INPUT_H x INPUT_W:  x 1376 x 800
1376 x 800
800
Segmentation fault (core dumped)

@NVES, the error you get above is caused by the incomplete east.uff file as you may get my repo by only git clone xxx. As I upload the large east.uff file by git-lfs. So you can install git-lfs by sudo yum install git-lfs if your work system is centos, and then use “git lfs clone https://github.com/IvyGongoogle/test-trt.git” to get the my complete repo with correct east.uff.

thanks.

looking forward your reply.

Hello,

Per engineering,

  1. engine->createExecutionContext(), context->destroy(), cudaStreamCreate(), and cudaStreamDestroy() shoud NOT be run in parallel with context->enqueue() in other threads since they contain blocking cuda API calls (like cudaMemcpy(), mainly for cuDNN/cuBLAS initialization and destruction). After commented out these function calls, got >30% perf improvement for context->enqueue() with 5 and 10 threads.

  2. After doing (1), there are still some perf gap between 2 and 5 threads. This is because when multiple threads are all launching kernels or doing cudaMemcpyAsync() at the same time, the runtime of cudaKernelLaunch() becomes much longer, probably because of the limitations in cuda driver (see: https://devtalk.nvidia.com/default/topic/1025463/overlapping-kernel-computing-with-stream-per-cpu-thread-slow-kernel-launches-/ )

  3. Attached are the nvprof results after the blocking functions are removed. I can see that a lot of kernels have been running in parallel and there are no blocking cuda API calls. Encourage you to use nvprof if you’d like to have a quick check for any apparent perf issues.

Based on these points, this is not a TRT bug.
T5_pure.zip (56.6 MB)

@NVES thanks for your reply. but how to open your T5_pure.nvprof. I must use nvvp to import it ?

@NVES I run nvprof ./main ./TestData/ and get the info:

==7692== Profiling application: ./main ./TestData/
==7692== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   29.44%  15.8338s      3000  5.2779ms  249.38us  17.354ms  sgemm_128x128x8_NN_vec
                   19.76%  10.6255s      3050  3.4838ms  1.0880us  81.641ms  [CUDA memcpy HtoD]
                   13.04%  7.01351s     18000  389.64us  179.43us  1.5858ms  trt_maxwell_scudnn_128x128_relu_interior_nn_v1
                   10.48%  5.63651s     12000  469.71us  110.47us  977.97us  trtwell_scudnn_128x128_relu_interior_nn
                    6.17%  3.31926s      8000  414.91us  363.59us  2.0500ms  trt_maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148n_nt_v1
                    4.15%  2.23465s      7000  319.24us  110.85us  1.4515ms  trtwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt
....

it seems that engine->createExecutionContext(), context->destroy(), cudaStreamCreate(), and cudaStreamDestroy() do not occupy much time.

Please reference this blog: https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/