Reducing binary size while using accelerated libs

I am new to CUDA programming. I am writing a program that uses the dense getrf and getrs routines from the cuSOLVER library to solve a system of linear equations of the form Ax=B.

I am linking the required CUDA math libs statically, because the users of the final solution may not have these installed.

The resulting binary is ~150MB in size. I am trying to find a way to reduce this as much as possible.

I initially thought that the code coming from the static libs must be contributing to this size. So I checked the sizes of the sections using “objdump -h”. The size of .text was only ~10MB. However, the size of the .nv_fatbin section was ~117MB.

I was already using -arch=sm_86 without the -code option which, according to the nvcc guide, is equivalent to “-arch=compute_86 -code=compute_86,sm_86”. I tried -O5 and -lto. No improvement was seen.

I used the -keep and -keep-dir options to check the size of the nvcc temporary files. The fatbin file generated for my code was only 1032 bytes. The sizes of all the temporary files were as follows (without -O5 and -lto) :

$ ls -l | awk ‘{print $5 “\t” $9}’
1625492 lin_eq_cus2.cpp1.ii
1469701 lin_eq_cus2.cpp4.ii
21 lin_eq_cus2.cudafe1.c
1376525 lin_eq_cus2.cudafe1.cpp
45233 lin_eq_cus2.cudafe1.gpu
3109 lin_eq_cus2.cudafe1.stub.c
1032 lineqcus2_dp_dlink.fatbin
3408 lineqcus2_dp_dlink.fatbin.c
2992 lineqcus2_dp_dlink.o
32 lineqcus2_dp_dlink.reg.c
952 lineqcus2_dp_dlink.sm_86.cubin
31064 lin_eq_cus2.fatbin
84088 lin_eq_cus2.fatbin.c
17440 lin_eq_cus2.ltoir
28 lin_eq_cus2.module_id
49184 lin_eq_cus2.o
28779 lin_eq_cus2.ptx
22224 lin_eq_cus2.sm_86.cubin
$

I used the -Xcompiler -save-temps=cwd option to check the temporary files of gcc and g++. There were two files that were ~2MB in size but the rest were smaller.

I used the -c option and found the .o file to be ~27KB only which meant the extra size came from external sources. I checked the libcusolver, libcublas and libcublasLT static libs with objdump and found they all contain .nv_fatbin sections.

Is the .nv_fatbin section in my binary so large because it is containing some of the .nv_fatbin sections from the math libs? Is there a way to reduce the size of my binary?


My compilation command line looks something like this :

nvcc -m64 -arch sm_86 -ccbin <gnu_path> -DDP -I <cuda_includes> -I <math_libs_include> lin_eq_cus2.cu -o lineqcus2_dp -L <cuda_libs_path> -L <math_libs_path> -Xlinker -Bstatic -lcusolver_static -lcublas_static -lculibos -lcudart_static -lcublasLt_static -Xlinker -Bdynamic -ldl -lpthread -lrt

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
$

The version of gcc and g++ is 9.3.1-2.

Thanks
Karthik

You might want to see if nvprune may help.

In addition to what Robert mentioned. What happens when you use -arch=compute_86 -code=sm_86 versus -arch=compute_86 -code=compute_86,sm_86? This will remove PTX from the final binary, leaving only SASS.

@Robert_Crovella, @mnicely ,

I ran nvprune on libcusolver_static.a, libcublas_static.a and libcublasLt_static.a. Using the pruned libs reduces the size of my binary to ~44MB (from ~150MB).

However, the resulting binary works only with small matrices, up to size 17. Beyond that cusolverDnDgetrf() fails.

Matrix size up to 17 : program works fine.
Matrix size 18 to 309 : cusolverDnDgetrf() fails with status 7
Matrix size 310 to 10000 : cusolverDnDgetrf() fails with status 6

According to cusolver_common.h error status 7 is CUSOLVER_STATUS_INTERNAL_ERROR and error status 6 is CUSOLVER_STATUS_EXECUTION_FAILED.

Same behaviour after making Matt’s option changes. In the final product, we plan to use matrix sizes 1-1000, 2000, 3000, 4000, …, 10000.

I did not use all the matrix sizes in the given ranges above. I tried just a few to see where the behaviour switch happens.

The original large binary does not have this failure.

Thanks
Karthik

I’m assuming you pruned everything except sm_86

It’s possible that not all the functions you need are implemented in sm_86. It’s possible that some of the functionality is implemented in sm_80 (which is binary compatible on a sm_86 GPU).

Try pruning to include both sm_80 and sm_86.

Another option would be to prune for sm_80 and sm_86 and compute_86

A final option might be to just prune for compute_86.

I’m assuming you pruned everything except sm_86

I used “-arch sm_86” which I am guessing will prune everything except sm_86, according to the nvprune manual.

Try pruning to include both sm_80 and sm_86.

Another option would be to prune for sm_80 and sm_86 and compute_86

Even though the online nvprune manual indicates that multiple architectures can be passed to the -arch option, the tool fails.

$ nvprune --arch sm_80,sm_86 libcusolver_static.a -o libcusolver_static_c86.a
nvprune fatal   : Unsupported gpu architecture 'sm_80,sm_86'
$
$
$ nvprune --arch sm_80,sm_86,compute_86 libcusolver_static.a -o libcusolver_static_c86.a
nvprune fatal   : Unsupported gpu architecture 'sm_80,sm_86,compute_86'

I am using CUDA version 11.4 at present. I tried nvprune from 11.5. Same results.

A final option might be to just prune for compute_86.

There were no fatal errors but the following warning was given for libcublas_static.a and libcublasLt_static.a

nvprune warning : No device code that matched architecture, so stripped out all device code

The pruned libs were generated but the program fails with

cuSOLVER initialization failed with error code 7

It’s possible that not all the functions you need are implemented in sm_86. It’s
possible that some of the functionality is implemented in sm_80 (which is binary
compatible on a sm_86 GPU).

I used “objdump -t” on the binary and looked for lines with the flags “df” indicating names of files. There were files whose names contained “compute_86” or “compute_61” or “compute_75” but there were many files with no such phrase in them.

Thanks
Karthik

Because you have specified the command incorrectly. The manual states:

-gencode
This option is same format as nvcc --generate-code option, and provides a way to specify multiple architectures which should remain in the object or library. Only the ‘code’ values are used as targets to match. Allowed keywords for this option: ‘arch’,‘code’.

Notice the description applies to the -gencode option, not the -arch option as you are using. This usage is the same syntax as nvcc. So you would do something like:

nvprune -gencode arch=compute_86,code=sm_86 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=compute_86 libcusolver_static.a ...
         ^selects cc8.6 SASS                 ^selects cc8.0 SASS                 ^selects cc8.6 PTX

Not all libraries are necessarily built the same way, you may need to do some inspection with cuobjdump -ptx and cuobjdump -sass to see which architectures are in which libraries. Based on that you may need to try combinations of selections to see what will work for your application. For example, I observed the warning you state for libcublasLt_static.a, but if I select cc8.0 PTX instead of cc8.6 PTX, it runs without that warning.

FWIW I did the following:

  1. copy the getrf/getrs sample code from here.

  2. change m from 3 to 32 (this seems to be within one of the ranges where you were having trouble)

  3. extract 3 pruned libraries as follows:

    nvprune -gencode arch=compute_70,code=sm_70 /usr/local/cuda/lib64/libcusolver_static.a -o libcus_70.a
    nvprune -gencode arch=compute_70,code=sm_70 /usr/local/cuda/lib64/libcublas_static.a -o libcub_70.a
    nvprune -gencode arch=compute_70,code=sm_70 /usr/local/cuda/lib64/libcublasLt_static.a -o libcLt_70.a
    
  4. compile like so:

    nvcc t1.cu -o t1 -L. -Xlinker -Bstatic -lcus_70 -lcub_70 -lculibos -lcudart_static -lcLt_70 -Xlinker -Bdynamic -ldl -lpthread -lrt
    
  5. and it ran without any errors on my V100 (the numerical output was bogus, of course, but there were not errors reported from CUDA or the cusolver library calls). The original binary was in the 150MB range, and the pruned binary was in the 47MB range.

Robert,

I tried your suggestion.

nvprune -gencode arch=compute_86,code=sm_86 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=compute_86 ...

My binary size comes down to 70MB with the pruned libs. I tested the code with matrix sizes 1 to 1000 (with increment 1) and sizes 2000 to 10000 (with increment 1000). I did not see any failures.


So, with the pruned libs, nvcc is able to get code out of the math libs in such a way that it keeps the binary size around 70MB. These functions/files are present in the unpruned libs too. So why is nvcc not able get these from the unpruned libs and keep the binary size small, especially if -arch and -code options are used?

I understand that nvcc options like -arch, -code , -gencode and -lto work on the user’s source code being compiled. Is there an option that can make nvcc link with a specific architecture’s code from inside a library? If nvcc can provide such a feature, then we need not prune the libs.

Thanks
Karthik

It sounds like you’re asking for prune functionality to be built-in to nvcc. I’m not aware of any such nvcc option that would accomplish what you have done here.

If you’d like to see a change in behavior in CUDA, my suggestion would be to file a bug. I don’t know of any specific reasons why prune functionality could not be built into nvcc, although there may be reasons I haven’t thought of. As you’ve already discovered, it may not be a simple matter of matching the user’s specified arch switches.

I have filed the following bug :

https://developer.nvidia.com/nvidia_bug/3485469

I was wondering why the math libs are single, fat libs. Instead there could have been multiple versions, perhaps one per virtual architecture (e.g libcusolver_compute80.a, libcusolver_compute70.a etc). Inside each lib, there could be multiple versions of each API, one per real architecture maybe (e.g getrf_sm8_6(), getrf_sm8_7() etc). All this could be hidden from the user who will simply call getrf() and use -lcusolver. Based on the user’s choice, nvcc could change the API and lib names to select the correct lib and API versions. That would probably keep the binary size smaller.

I am sure all of these things were considered by the nvcc team. There must be some reason for the present approach.