CUDA runtime API first call very slow on Xavier with JetPack 4.6

We updated the Jetpack 4.6 on our Xavier, but every first call with CUDA c++ runtime API to access the GPU memory (e.g. cudaMalloc, cudaMemGetInfo) was extremely slow! Approximately 8 minutes.
When the process is finished, the next run is still the same time-consuming.

Any advice for solving this issue?



We don’t expect cudaMalloc will take such a long time.
Would you mind sharing a sample so we can check it in our environment?

More, please note that you can maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks



size_t free_mem = 0;
size_t total_mem = 0;
cudaError_t error_id = cudaMemGetInfo(&free_mem, &total_mem)  #### TAG-1 !

After this we call a method to allocate GPU memory:

void* safeCudaMalloc(size_t memSize) {
  void* deviceMem;
  CHECK_CUDA_STATUS(cudaMalloc(&deviceMem, memSize)); ###### TAG-2
  CHECK_NOTNULL(deviceMem) << "Out of memory";
  return deviceMem;

Just simple code.
This is slow on the TAG-1 line, using cudaMemGetInfo, and code with TAG-2 is normal and fast.
If we comment line TAG-1, the cudaMalloc (TAG-2) is slow.
We tried the maximizing performance command, it was the same.

Previously, we used the same code in Jetpack 4.2 environment with no problems.



We cannot reproduce this issue with our Xavier + JetPack4.6.
main.cpp (366 Bytes)

$ time ./test

real    0m0.090s
user    0m0.024s
sys     0m0.044s

Could you also try the sample in your environment?

We double-checked and found an extra shared library (tensorflow_cc) in our build config file.

For the main.cpp you provided, if we use ‘clang’ to compile the main.cpp file with cudart and tensorflow_cc library, the process will be stuck:

clang main.cpp -lcudart -ltensorflow_cc

The clang version is 12,0,1.
Tensorflow version is 2.6.1

If we use ‘g++’ or ‘nvcc’ to compile the main.cpp with cudart and tensorflow_cc library, it is normal.


We also tried clang version 8.0, but it was the same.
Any suggestions?


Thanks for this information.

Not sure if any different behavior between compiling with g++ or clang.
If the TensorFlow library is preloaded, it takes some time to finish.

We are checking this in deep and will share more information with you later.


It looks like this issue comes from the TensorFlow library.

We try the sample with clang and only add the required header and link.
The binary can work fine without issue.

$ clang main.cpp -I/usr/local/cuda/include/ -L/usr/local/cuda/lib64/ -lcudart -o test
$ time ./test 
real	0m0.110s
user	0m0.028s
sys	0m0.036s

May I know how do you set up the TensorFlow c++ library?
Do you build it from the source on Jetson?


We set up the TensorFlow c++ library following this web .

and could u compile with arg -ltensorflow_cc to see whether it is our own problem or not.



Since there is already an internal team and bug for this issue.
Please check the following status from the bug system directly.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.