Can't skip JIT for libtorch on orin

pengxiao · July 27, 2023, 2:30pm

Run a tiny libtorch demo inside container nvcr.io/nvidia/l4t-ml:r35.2.1-py3 on jetson orin, it will spend about 50s on JIT.
But I expect all the libtorch libraries in this container have sm_87 device code, and jetson orin’s compute capability is 8.7, so I shouldn’t need JIT.

Is my expectation incorrect? And how can I find what libraries lack sm_87 code?

Demo code:

#include <torch/torch.h>

#include <cuda_runtime_api.h>

#define CUDA_CHECK(call)                                                                      \
  {                                                                                           \
    cudaError_t _e = (call);                                                                  \
    if (_e != cudaSuccess) {                                                                  \
      LOG(FATAL) << "CUDA Runtime failure: '#" << cudaGetErrorName(_e) << "' at " << __FILE__ \
                 << ":" << __LINE__;                                                          \
    }                                                                                         \
  }

int main() {
  float data[64*28*28] = {};
  torch::Tensor input_tensor = torch::from_blob(data, {7, 1, 28, 28}, torch::kFloat).to(torch::kCUDA);
  CUDA_CHECK(cudaPeekAtLastError());
}

Additional information.

Run with CUDA_DISABLE_PTX_JIT=1, the program crashed due to cudaErrorJitCompilationDisabled, so it need JIT indeed.
Use image nvcr.io/nvidia/pytorch:22.12 on platform x64 + RTX 3080, the demo works well and ~/.nv/ComputeCache/ is empty, so the problem may be related to jetson orin.
I tried to find the problematic library by running cuobjdump -lelf libtorch_cuda.so, it show 1224 cubin info, and 284 of them are sm_87, I don’t know to analysis it, so I gave up.

AastaLLL · July 28, 2023, 3:18am

Hi,

Thanks for the feedback on this issue.

We need to reproduce it internally and check with our internal team.
Will let you know the following.

Thanks.

AastaLLL · August 1, 2023, 9:21am

Hi,

Could you share the Makefile or CMake file with us as well?
Thanks.

pengxiao · August 1, 2023, 10:18am

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(torch_demo)

find_package(Torch REQUIRED)

add_executable(demo demo.cc)
target_link_libraries(demo "${TORCH_LIBRARIES}")
set_property(TARGET demo PROPERTY CXX_STANDARD 14)

And the whole environment and step is in GitHub - noaxp/torch_demo

Thanks!

AastaLLL · August 2, 2023, 6:46am

Hi,

Thanks for the help.

We can reproduce the same behavior in our environment and need our internal team to check it further.
Will share more information with you later.

Thanks.

AastaLLL · August 30, 2023, 7:39am

Hi,

Our internal team has shared some info on this issue.

There are a bunch of cutlass transformer kernels which not natively built on the l4t Jetson container.
So the CUDA context will try to load them in via the JIT compiler and cause the ~50s load time.
In a regular PyTorch container that does more comprehensive builds, the cutlass transformer kernels would have already been natively built and JIT wouldn’t have to load them.

In CUDA 11.7 or above, we introduce lazy module and function loading, which means modules and functions are loaded only when they are called into.
So regardless of whether the cutlass transformer kernels are natively built or not, CUDA won’t bother loading them unless you actually call into such a kernel.

In our future l4t container release will have an updated CUDA that contains the lazy module and function loading, so this 50s loading behavior shouldn’t be present there.

Thanks.

system · September 25, 2023, 6:25am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Orin Pytorch/CUDA issue Jetson Orin NX cuda , pytorch , jetson-nano	11	2842	September 11, 2023
It it possible precompile JIT cache for nvidia deepstream libraries for the Orin? Jetson AGX Orin deepstream	7	55	August 29, 2024
Trouble trying to install torch in Docker container on JP6.0-dp Jetson Orin NX docker , pytorch	2	428	March 21, 2024
CUDA error: no kernel image is available for execution on the device Jetson AGX Orin yolo	6	5288	October 4, 2022
Jetson orin nano Cuda Cudnn torch torchauido torchvision Jetson Orin Nano cuda	11	812	May 27, 2024
Jetson Orin Nano CUDA 12.6 in JetPack 6.1 in Tensorflow and Pytorch Jetson Orin Nano cuda , pytorch	11	1153	November 21, 2024
Pytorch installed on l4t-jetpack:r35.4.1 container on Jetson Orin Nano (JetPack 6.0 Developer Kit) fails to recognize CUDA Jetson Orin Nano cuda , docker , pytorch , python , containers	2	129	October 22, 2024
Error flashing Jetson Orin including Cuda Jetson AGX Orin reflash , cuda	9	1111	July 11, 2022
Docker run failed on orin developer kit Jetson Orin NX	10	1075	October 25, 2023
Installing Pytorch in Orin (libmpi.so.20 issue) Jetson Nano pytorch	6	1827	January 18, 2023

Can't skip JIT for libtorch on orin

Related topics