That is called an overconstrained problem in engineering terms. I am curious: What is the particular reason for avoiding use of the JIT cache for this use case?
JIT compilation happens via the pxtas functionality incorporated into the CUDA driver. Pretty much everything that happens in the CUDA driver is running single threaded. The performance is dominated primarily by single-thread CPU performance and secondarily by system memory performance. Use of a CPU with a base frequency >= 3.5 GHz highly recommended. With JIT cache I would have said using an NVMe SSD should also help, but whether that makes any difference when the JIT cache is not in use I do not know.
ptxas is an optimizing compiler and therefore compilation time generally increases with optimization level (0 through 3). I am assuming you want the resulting machine code to be fully optimized. If you are willing to dial down optimization level the question is: can optimization level be controlled for JIT compilation (for offline compilation this is simply a flag nvcc passes to ptxas)? Probably so, but I don’t know how that works.
Here is another thought: The execution time of ptxas is also a function of the amount of code that needs to be handled. The LLVM-derived part of the CUDA toolchain normally inlines functions and unroll loops aggressively. This can lead to very voluminous PTX code: I have encountered files with 60,000 lines of PTX code when looking into details of reports of excessive amount of time being spent at the ptxas stage of compilation. The use of __noinline__ and #pragma unroll 1may help with cutting down PTX code size, which may benefit compilation times in exchange for some amount of potential performance loss.