CUDA runtime API first call very slow on Xavier with JetPack 4.6

346365459 · January 13, 2022, 8:36am

Hi,
We updated the Jetpack 4.6 on our Xavier, but every first call with CUDA c++ runtime API to access the GPU memory (e.g. cudaMalloc, cudaMemGetInfo) was extremely slow! Approximately 8 minutes.
When the process is finished, the next run is still the same time-consuming.

Any advice for solving this issue?

Thanks!
Bo

AastaLLL · January 13, 2022, 8:53am

Hi

We don’t expect cudaMalloc will take such a long time.
Would you mind sharing a sample so we can check it in our environment?

More, please note that you can maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

346365459 · January 13, 2022, 9:23am

Hi,

size_t free_mem = 0;
size_t total_mem = 0;
cudaSetDevice(0);
cudaError_t error_id = cudaMemGetInfo(&free_mem, &total_mem)  #### TAG-1 !

After this we call a method to allocate GPU memory:

void* safeCudaMalloc(size_t memSize) {
  void* deviceMem;
  CHECK_CUDA_STATUS(cudaMalloc(&deviceMem, memSize)); ###### TAG-2
  CHECK_NOTNULL(deviceMem) << "Out of memory";
  return deviceMem;
}

Just simple code.
This is slow on the TAG-1 line, using cudaMemGetInfo, and code with TAG-2 is normal and fast.
If we comment line TAG-1, the cudaMalloc (TAG-2) is slow.
We tried the maximizing performance command, it was the same.

Previously, we used the same code in Jetpack 4.2 environment with no problems.

Thanks.
Bo

AastaLLL · January 18, 2022, 5:55am

Hi,

We cannot reproduce this issue with our Xavier + JetPack4.6.
main.cpp (366 Bytes)

$ time ./test

real    0m0.090s
user    0m0.024s
sys     0m0.044s

Could you also try the sample in your environment?
Thanks.

346365459 · January 18, 2022, 10:50am

Hi,
We double-checked and found an extra shared library (tensorflow_cc) in our build config file.

For the main.cpp you provided, if we use ‘clang’ to compile the main.cpp file with cudart and tensorflow_cc library, the process will be stuck:

clang main.cpp -lcudart -ltensorflow_cc

The clang version is 12,0,1.
Tensorflow version is 2.6.1

If we use ‘g++’ or ‘nvcc’ to compile the main.cpp with cudart and tensorflow_cc library, it is normal.

Thanks.
Bo

346365459 · January 21, 2022, 1:56am

Hi,
We also tried clang version 8.0, but it was the same.
Any suggestions?

Best,
Bo

AastaLLL · January 21, 2022, 6:21am

Thanks for this information.

Not sure if any different behavior between compiling with g++ or clang.
If the TensorFlow library is preloaded, it takes some time to finish.

We are checking this in deep and will share more information with you later.
Thanks.

AastaLLL · January 24, 2022, 8:13am

Hi,

It looks like this issue comes from the TensorFlow library.

We try the sample with clang and only add the required header and link.
The binary can work fine without issue.

$ clang main.cpp -I/usr/local/cuda/include/ -L/usr/local/cuda/lib64/ -lcudart -o test
$ time ./test

real	0m0.110s
user	0m0.028s
sys	0m0.036s

May I know how do you set up the TensorFlow c++ library?
Do you build it from the source on Jetson?

Thanks.

981504100 · January 24, 2022, 9:15am

Hi,
We set up the TensorFlow c++ library following this web .

and could u compile with arg -ltensorflow_cc to see whether it is our own problem or not.

Thanks.

AastaLLL · January 27, 2022, 2:30am

Hi,

Since there is already an internal team and bug for this issue.
Please check the following status from the bug system directly.

Thanks.

system · February 23, 2022, 5:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.