CUDA runtime API first call very slow on Xavier with JetPack 4.6

Hi,
We updated the Jetpack 4.6 on our Xavier, but every first call with CUDA c++ runtime API to access the GPU memory (e.g. cudaMalloc, cudaMemGetInfo) was extremely slow! Approximately 8 minutes.
When the process is finished, the next run is still the same time-consuming.

Any advice for solving this issue?

Thanks!
Bo

Hi

We don’t expect cudaMalloc will take such a long time.
Would you mind sharing a sample so we can check it in our environment?

More, please note that you can maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hi,

size_t free_mem = 0;
size_t total_mem = 0;
cudaSetDevice(0);
cudaError_t error_id = cudaMemGetInfo(&free_mem, &total_mem)  #### TAG-1 !

After this we call a method to allocate GPU memory:

void* safeCudaMalloc(size_t memSize) {
  void* deviceMem;
  CHECK_CUDA_STATUS(cudaMalloc(&deviceMem, memSize)); ###### TAG-2
  CHECK_NOTNULL(deviceMem) << "Out of memory";
  return deviceMem;
}

Just simple code.
This is slow on the TAG-1 line, using cudaMemGetInfo, and code with TAG-2 is normal and fast.
If we comment line TAG-1, the cudaMalloc (TAG-2) is slow.
We tried the maximizing performance command, it was the same.

Previously, we used the same code in Jetpack 4.2 environment with no problems.

Thanks.
Bo

Hi,

We cannot reproduce this issue with our Xavier + JetPack4.6.
main.cpp (366 Bytes)

$ time ./test

real    0m0.090s
user    0m0.024s
sys     0m0.044s

Could you also try the sample in your environment?
Thanks.

Hi,
We double-checked and found an extra shared library (tensorflow_cc) in our build config file.

For the main.cpp you provided, if we use ‘clang’ to compile the main.cpp file with cudart and tensorflow_cc library, the process will be stuck:

clang main.cpp -lcudart -ltensorflow_cc

The clang version is 12,0,1.
Tensorflow version is 2.6.1

If we use ‘g++’ or ‘nvcc’ to compile the main.cpp with cudart and tensorflow_cc library, it is normal.

Thanks.
Bo

Hi,
We also tried clang version 8.0, but it was the same.
Any suggestions?

Best,
Bo

Thanks for this information.

Not sure if any different behavior between compiling with g++ or clang.
If the TensorFlow library is preloaded, it takes some time to finish.

We are checking this in deep and will share more information with you later.
Thanks.