Fatbinary best practices

My understanding is that by compiling PTX for the lowest supported arch, we provide the widest possible support, without having to generate SASS for all possible SM architectures (bloating the fatbinary unnecessarily).

For example, if we know that an application supports sm_52 or later and will typically run on say sm_52 and sm_70 GPUs, we would generate PTX for compute_52 and SASS for sm_52 and sm_70. If the application then runs on sm_61 or sm_86, the driver would JIT the compute_52 PTX and all would be well.

However, if we were to generate PTX for the latest arch only (say compute_86), then we would have to explicitly generate SASS for all possible GPU architectures (e.g. sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, etc) in order for the application to run on all supported GPU architectures, thereby increasing the size of the fatbinary considerably. To avoid including SASS for all possible archs, we could just include SASS for the base archs, e.g. sm_50, sm_60, sm_70 and sm_80. That way, the driver will just select the appropriate base SASS (e.g. if running on an sm_61 GPU, the driver will select the sm_60 SASS). Indeed, this is the approach suggested by the CUDA Programming Guide.

If taking the latter approach (i.e. only include SASS for the base archs), is sm_60 SASS just as optimized/efficient as sm_61 SASS when running on an sm_61 GPU?

Additionally, are there any further considerations required when dealing with DLTO and fatbinaries containing LTO NVVM IR?

1 Like

@njuffa @Robert_Crovella @mmurphy1 I would appreciate your thoughts on this :) Particularly on the second to last paragraph, given that the hardware/specs (e.g. registers per block, amount of shared mem, etc) can change within a major arch (e.g. sm_60 to sm_61), which I would then expect to have an impact on the optimizations the compiler is able to apply when generating SASS.

My $0.02: Generally, you would want to avoid JIT compilation as much as you can, because it can create pretty noticeable overhead. Sometimes it is unavoidable, for example when custom kernels are built dynamically at application run-time based on user input. Some people have build apps with very sophisticated yet fast dynamically configurable processing pipelines in this way.

For the common use case where all code can be built off-line, my recommendation is to emit SASS for all GPU architectures that need to be supported by an app. Yes, that could be more than half a dozen at any given time, so it could result in an obese fat binary. I have not come a situation yet where the resulting file size of the executable binary is an issue, which is not to say such scenarios could not exist. In that case I would look into compressing the executable with self-extracting capability. If it is the off-line build times become bothersome, the time may have come to invest into a faster build machine.

With every new GPU architecture there is the problem that a new toolchain may not be available early enough to have the app include SASS for the latest hardware when that hardware ships. More commonly the hardware introduction does not line up with the app’s release cycle or the app needs to be modified for the latest GPU. For this, one needs to future-proof the app by including a single PTX target, in particular the latest available whose restrictions are closest to the new architecture.

If one were to include PTX for the oldest supported architecture instead, this could unnecessarily restrict the JIT compiler when building for newer architectures, with a likely negative performance impact.

I do not have sufficient experience with DLTO to dispense any particular wisdom related to it.

Thanks for the prompt/detailed reply! I shall start by saying that I agree with your recommendations - the following questions are just for my understanding :)

When you say overhead, do you just mean the time taken to perform the JIT compilation? I would have thought that for most use cases, the JIT compilation cache would make this fairly irrelevant after the first invocation (of course, there may be use cases where the application is only ever run once on a given PC and the JIT compilation time is large compared to the runtime, but I would have thought that would be rare).

Indeed, the concern is more around the build times rather than binary sizes (although it is always nice to keep these as slim as possible :) ). If we exclude Tegra/Jetson archs, we are still left with 10 archs to compile for currently (sm_50 - sm_90) - even on a fast build machine (which we are lucky enough to have), this still takes more time than compiling for fewer archs.

I guess a compromise would be to include the PTX for both the lowest and the highest arch. That way, you could avoid explicitly including SASS for archs you knew you were unlikely to run on, whilst still maintaining compatibility for them and also without restricting future archs too much.

I think I know the answer already (no), but I guess this is something I will need to test via benchmarking.

Just that. I don’t know what your application looks like. I certainly recall a handful of cases reported in this sub-forum where people complained about lengthy start times for their application, caused by the time the JIT compiler took on their target machine. Part of the problem is that PTX can balloon (through loop unrolling and function inlining) to very lengthy code indeed. I recall one case where the code expanded to more than 100K lines. I also seem to recall some use cases that exceeded the JIT cache size (that may no longer be an issue). Also cases where JITing was just slow because the target system was fairly slow. Also cases where compilation time ballooned due to ptxas applying some optimizations that cause it to run for several minutes.

My philosophy is that it is best to experience such issues in a controlled environment, such as a dedicated build system for offline compilation, supervised by people who are intimately familiar with the app and who maintain build time stats. Yes, in the worst case build time for the CUDA portion of the app could scale linearly with the number of target architectures. But if you think that building is too slow on a system with a 16-core Xeon W-3335 CPU, PCIe gen4 NVMe SSDs, and 256 GB of 8-channel DDR4-3200 system memory, build time may also be an annoyance on customers’ target systems.

1 Like

@njuffa gives good guidance here. Regarding whether you need to store separate SASS for minor versions (like sm_61 vs sm_60), it really depends on whether your application would differ. E.g. you could build for both 60 and 61 and see what the differences are. Often there will be no difference in which case you can just use the sm_60 version.

Regarding the NVVM IR, in general you only want to use that for -dlto linking, not for implicit JIT linking.

@mmurphy1 Thanks! When you say “whether your application would differ” do you mean whether the application uses the __CUDA_ARCH__ macro to perform different code paths depending on the arch? Or do you just mean whether the resulting SASS differs due to compiler optimizations? Or something higher-level, like overall performance of the application?

Based on a quick test of one of our applications, it does indeed show some minor differences in the SASS within a given major arch, i.e. sm_60 and sm_61 (it looks like a re-ordering of instructions/blocks within one of the kernels), even though we don’t use the __CUDA_ARCH__ macro. The differences probably wouldn’t make much difference in practice, but we would obviously need to do some performance testing to be sure.

So essentially there doesn’t seem to be an a priori way of determining whether you can just compile SASS for the major version (e.g. sm_60) without negatively impacting performance (it may be fine, it might not). Basically, to be safe, follow the recommendations outlined by njuffa above :)