nvcc cubin for multiple platforms How can I produce CUBIN for all platforms?

Hi guys,

I’d like to compile a .cu file to all platforms in a single pass compilation, generating a .cubin which could contain code for all compute capabilities.

The Fermi Compatibility Guide for CUDA Applications, on item 1.3.1 says:

Well, after many trials and errors, errors, errors, I realized that in spite nvcc 3.2 is able to generate CUBIN files for all architectures, it does not imply that nvcc is able to do it in a single run.

The other route could be a .ptx file but I don’t have any experience on it.

Could you guys advise me what route I should take?

Thanks a lot

A fat binary for as many platforms as desired can be built in one compilation by using one -gencode command line switch per architecture. Give this a try:

-gencode arch=compute_10,“code=sm_10,compute_10” -gencode arch=compute_11,“code=sm_11,compute_11” -gencode arch=compute_12,“code=sm_12,compute_12” -gencode arch=compute_13,“code=sm_13,compute_13” -gencode arch=compute_20,“code=sm_20,compute_20” -gencode arch=compute_21,“code=sm_21,compute_21”

It cant’t be done. Cubin files can’t be built to contain more than a single architecture per file. If you need to support more than one architecture using cubins you will need to have selection code in your application and use the API to determine which cubin file to load at runtime. This is (sort of) discussed in the Fermi compatibility guide that comes with the toolkit

Correct, the commandline switches I suggested will generate one .cubin file per architecture (with extensions .sm_10.cubin, .sm_11.cubin, etc.) in a single run of nvcc. These .cubin files can be saved off with --keep. The cubins for all architectures are embedded into a single .o file. Loading of the resulting fat binary and the selection of the appropriate machine code at run time is performed automatically by the CUDA runtime, which is just one reason I highly recommended using the runtime.

Of course it is not always possible to use the runtime. PyCUDA is one of my preferred ways of working with GPUs, and it only supports the runtime API and can only load pre-compiled kernels from cubin files.