Slow startup using JIT'd PTX with libtorch

I am trying to run an application that uses libtorch built against the CUDA 10.2 runtime. I built libtorch with PTX by specifying6.1;7.0;7.5+PTX as the arch list. If I either run on a GPU that isn’t in this list (e.g. ampere) or a GPU that is in this list but set CUDA_FORCE_PTX_JIT=1, the application takes 25 minutes to start. Here’s a simple reproducer:

#include <torch/script.h>

#include <iostream>

int main(int argc, const char* argv[]) {
  const c10::Device device = at::kCUDA;
  std::cout << "torch::randn(1, torch::TensorOptions(device))\n";
  torch::randn(1, torch::TensorOptions(device));
  std::cout << "Done!\n";
}

This application takes 25 minutes when I force JIT. Is this expected?

The second issue I don’t understand is that it takes 25 minutes every time I run it. I would have expected the JIT compiled kernels to get cached in ~/.nv/ComputeCache but from what I observe by using strace and watching that directory, there is a single file that is constantly being overwritten in this directory and so it seems caching is broken.

The machine I’m running this test on has a RTX 2070 in it, and I’m running driver 460.91.03. The driver reports CUDA version 11.2, but as I mentioned previously the application is built against the 10.2 runtime. The documentation for CUDA states that “The CUDA driver maintains backward compatibility to continue support of applications built on older toolkits.” which lead me to believe this should work.

A couple of generic remarks:

(1) 25 minutes compiling PTX to machine code sounds like a lot. The most common cause for lengthy compile times is voluminous source code. It sounds like you are JIT-compiling an entire (largish?) library, causing a lot of code to be compield. That is rarely a good idea. A less common cause for lengthy compile time is hitting a sub-optimal corner case in one or several optimization passes. Often this is accompanied by significant growth in memory usage by ptxas. If you have evidence for this happening, it’s probably worthwhile to file a bug with NVIDIA.

(2) This is speculative: The amount of cache storage provided for JIT compilation may be limited, and when very bulky code is being JIT compiled, there may be cache thrashing, negating the benefits of the cache.

Have you considered building the library off-line? Standard best practice is to build libraries as fat binaries that embed machine code for each GPU architecture to be supported, augmented by embedded PTX for the latest virtual architecture to attempt to future-proof the library at the cost of JIT compilation.

Unfortunately I’m restricted to using cuda 10.2, and I’d like to use an ampere GPU which isn’t supported by 10.2 leaving me with JIT compilation. I don’t understand why the entire library would have to be JIT compiled to launch (what I would assume is) a single kernel from torch::randn. I did also try setting CUDA_CACHE_MAXSIZE=4294967296 which didn’t change the way the cache behaved.

  1. You are forcing JIT.
  2. There are many libraries, with many kernels in them, linked to pytorch. All of that gets JIT-ed.
  3. The env var you set causes the JIT cache to be ignored, see here

so this all looks like expected behavior to me.

You might very well observe somewhat different behavior if you actually used ampere and actually did not set that env var. That env var is for test purposes, and the intended test is somewhat different than what you are testing here. The specific test is: “is this application JIT-able?”. The test you are trying to use it for is “how will a future GPU behave?” Those two tests are slightly different.

Thanks for the response, Robert. I should have clarified: I have tested it on a 3070 without the environment variable set and get the same behavior, it takes 25 minutes to run the above code.

I understand that there will be overhead for JIT, but it seems wrong that it takes 25 minutes to generate a random number just because I called the method in a library. Does it really have to JIT every kernel in the library just to call a single one? I would expect “just in time” to mean it only compile kernels as they are called, and not pre-emptively compile every kernel. Is there perhaps an issue either with how the library expresses dependencies between kernels or how the JIT evaluates dependency?

I would expect that also, on the first run at least. For example in pytorch, if you are doing training for example, I would expect the first training epoch to be quite slow due to this. Subsequent epochs will be much shorter. I’m not sure what the cache can hold, but it might be that after the first run of an application, you would also get benefit from the JIT cache, but not with that env var set.

That certainly seems to be the case, and your datapoint as well as a number of others that I have seen posted on this forum seem to confirm that. I won’t be able to delve into the detailed mechanism.

The behavior is evident. You’re not the first to witness it. I won’t be able to debate your expectations. If you’re dissatisfied, I suggest filing a bug.