Run a tiny libtorch demo inside container nvcr.io/nvidia/l4t-ml:r35.2.1-py3 on jetson orin, it will spend about 50s on JIT.
But I expect all the libtorch libraries in this container have sm_87 device code, and jetson orin’s compute capability is 8.7, so I shouldn’t need JIT.
Is my expectation incorrect? And how can I find what libraries lack sm_87 code?
Demo code:
#include <torch/torch.h>
#include <cuda_runtime_api.h>
#define CUDA_CHECK(call) \
{ \
cudaError_t _e = (call); \
if (_e != cudaSuccess) { \
LOG(FATAL) << "CUDA Runtime failure: '#" << cudaGetErrorName(_e) << "' at " << __FILE__ \
<< ":" << __LINE__; \
} \
}
int main() {
float data[64*28*28] = {};
torch::Tensor input_tensor = torch::from_blob(data, {7, 1, 28, 28}, torch::kFloat).to(torch::kCUDA);
CUDA_CHECK(cudaPeekAtLastError());
}
Additional information.
- Run with CUDA_DISABLE_PTX_JIT=1, the program crashed due to cudaErrorJitCompilationDisabled, so it need JIT indeed.
- Use image nvcr.io/nvidia/pytorch:22.12 on platform x64 + RTX 3080, the demo works well and ~/.nv/ComputeCache/ is empty, so the problem may be related to jetson orin.
- I tried to find the problematic library by running cuobjdump -lelf libtorch_cuda.so, it show 1224 cubin info, and 284 of them are sm_87, I don’t know to analysis it, so I gave up.