Can't skip JIT for libtorch on orin

Run a tiny libtorch demo inside container nvcr.io/nvidia/l4t-ml:r35.2.1-py3 on jetson orin, it will spend about 50s on JIT.
But I expect all the libtorch libraries in this container have sm_87 device code, and jetson orin’s compute capability is 8.7, so I shouldn’t need JIT.

Is my expectation incorrect? And how can I find what libraries lack sm_87 code?

Demo code:

#include <torch/torch.h>

#include <cuda_runtime_api.h>

#define CUDA_CHECK(call)                                                                      \
  {                                                                                           \
    cudaError_t _e = (call);                                                                  \
    if (_e != cudaSuccess) {                                                                  \
      LOG(FATAL) << "CUDA Runtime failure: '#" << cudaGetErrorName(_e) << "' at " << __FILE__ \
                 << ":" << __LINE__;                                                          \
    }                                                                                         \
  }

int main() {
  float data[64*28*28] = {};
  torch::Tensor input_tensor = torch::from_blob(data, {7, 1, 28, 28}, torch::kFloat).to(torch::kCUDA);
  CUDA_CHECK(cudaPeekAtLastError());
}

Additional information.

  1. Run with CUDA_DISABLE_PTX_JIT=1, the program crashed due to cudaErrorJitCompilationDisabled, so it need JIT indeed.
  2. Use image nvcr.io/nvidia/pytorch:22.12 on platform x64 + RTX 3080, the demo works well and ~/.nv/ComputeCache/ is empty, so the problem may be related to jetson orin.
  3. I tried to find the problematic library by running cuobjdump -lelf libtorch_cuda.so, it show 1224 cubin info, and 284 of them are sm_87, I don’t know to analysis it, so I gave up.

Hi,

Thanks for the feedback on this issue.

We need to reproduce it internally and check with our internal team.
Will let you know the following.

Thanks.

Hi,

Could you share the Makefile or CMake file with us as well?
Thanks.

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(torch_demo)

find_package(Torch REQUIRED)

add_executable(demo demo.cc)
target_link_libraries(demo "${TORCH_LIBRARIES}")
set_property(TARGET demo PROPERTY CXX_STANDARD 14)

And the whole environment and step is in GitHub - noaxp/torch_demo

Thanks!

Hi,

Thanks for the help.

We can reproduce the same behavior in our environment and need our internal team to check it further.
Will share more information with you later.

Thanks.

Hi,

Our internal team has shared some info on this issue.

There are a bunch of cutlass transformer kernels which not natively built on the l4t Jetson container.
So the CUDA context will try to load them in via the JIT compiler and cause the ~50s load time.
In a regular PyTorch container that does more comprehensive builds, the cutlass transformer kernels would have already been natively built and JIT wouldn’t have to load them.

In CUDA 11.7 or above, we introduce lazy module and function loading, which means modules and functions are loaded only when they are called into.
So regardless of whether the cutlass transformer kernels are natively built or not, CUDA won’t bother loading them unless you actually call into such a kernel.

In our future l4t container release will have an updated CUDA that contains the lazy module and function loading, so this 50s loading behavior shouldn’t be present there.

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.