[SOLVED] Trying to remove PTX from a shared library leaves it there, and kernels are not executed

saulocpp · January 16, 2019, 5:19pm

I am writing a shared library to be dynamically loaded by a program (as a plugin) and am trying to statically link cuFFT into the lib.
The cuFFT code is being compiled with the parameters indicated in the documentation for static linkage:

nvcc -O3 -Xcompiler -fPIC -lcufft_static -lculibos cufft_code.cu -c -o cufft_code.o

The shared library is compiled as follows:

nvcc -O3 some_code.o cufft_code.o -Xcompiler -fPIC -ldl --shared -o my_lib.so

In principle everything compiles without errors, even though my_lib.so has only 700KB and I was expecting it to be at least 100MB due to the static linking of cuFFT. But ok, let’s try to use it in a program:

void *lib_handle = dlopen("my_lib.so", RTLD_NOW | RTLD_GLOBAL);
if(lib_handle == NULL)
    {
    cout << "Unable to load my_lib.so: " << dlerror() << endl;
    return -1;
    }

Then dlerror() returns “undefined symbol: cufftDestroy”.
There is definitely something wrong/missing in my compilation process. Do you guys have any idea?

saulocpp · January 16, 2019, 6:02pm

Well, generating the shared library also required the static linking arguments:

nvcc -O3 some_code.o cufft_code.o -lcufft_static -lculibos -Xcompiler -fPIC -ldl --shared -o my_lib.so

This fixed, now my_lib.so is “properly” sized at 120MB and the library is opened without error.

saulocpp · January 18, 2019, 5:53pm

I will hijack my own thread to ask something else related to the compilation process.
In order to remove the PTX code, following the many SO threads showing how to do it, I added this to my Makefile:

CUDA_ARCHS_1	= -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37
CUDA_ARCHS_2	= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52
CUDA_ARCHS_3	= -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61
CUDA_ARCHS_4	= -gencode arch=compute_70,code=sm_70
CUDA_ARCHS	= $(CUDA_ARCHS_1) $(CUDA_ARCHS_2) $(CUDA_ARCHS_3) $(CUDA_ARCHS_4)

Then my linking line is:

nvcc $(CUDA_ARCHS) -O3 some_code.o cufft_code.o -lcufft_static -lculibos -Xcompiler -fPIC -ldl --shared -o my_lib.so

When I do a cuobjdump --list-ptx my_lib.so, it outputs 93 lines in the form:

PTX file   93: libGeneSeis.93.sm_70.ptx

Then I run a test program that calls functions of this lib, and my kernels are not executed (though the cuFFT calls are). The kernels work fine it I don’t use the -gencode pairs above.
Can you guys explain what could be wrong here, considering that the PTX codes shouldn’t be there and that my kernels are not executed in this configuration?

njuffa · January 18, 2019, 6:06pm

I just did a quick experiment with your -gencode switches on a single source file, and I don’t see any PTX embedded in the resulting object file.

This suggests that your object files were built with different switches than the one you are showing, and that you would want to inspect your build log to see what switches are actually being passed to nvcc when the object files are created.

Any third-party object files or libraries you link into your binary could likewise contain PTX code.

saulocpp · January 18, 2019, 6:20pm

Njuffa, sorry that I didn’t mention. I also used these options on a test program (this sample code when one creates a Nsight project), and I also don’t see the PTX code in the resulting executable.

One of my objects is compiled with:

nvcc -O3 -Xcompiler -fPIC some_code.cu -c -o some_code.o

And another object, that contains cuFFT calls, is compiled with:

nvcc -O3 -Xcompiler -fPIC -lcufft_static -lculibos cufft_code.cu -c -o cufft_code.o

The linking line has the -gencode options and the -ldl --shared switches, as seen above. I haven’t seen anywhere that when a shared lib is created, the PTX can’t be removed. So I assume a mistake somewhere in these compilation commands.

Robert_Crovella · January 18, 2019, 6:21pm

The cufft library itself has PTX embedded in it. If you statically link to cufft, any such binary will have PTX in it.

Of course I don’t think any of that explains why a code is or is not working.

If you strip out PTX, then your code won’t run except on the specific architectures you enumerated. So one possible explanation for “not working” is that the architecture you are running on is not one of the ones you enumerated.

if your kernels are not being executed, using proper CUDA error checking and/or running your code with cuda-memcheck will likely help to pinpoint the issue. For example, if you were running into the issue that I described, you would get a message along the lines of “no binary for GPU”

Failure to use proper CUDA error checking and also failure to test your code with cuda-memcheck, but writing a request here for help, is a waste of your time (and others) in my opinion.

njuffa · January 18, 2019, 6:25pm

The only way PTX code can get into a binary (executable or shared library) is if it is present in the object files that were linked together when creating the binary. This suggests you need to examine the object files for PTX content, then determine why they contain PTX.

Maybe try a make clobber to nuke all old object files that may still be lying around, and double check the switches actually used by nvcc to build the object files.

tera · January 18, 2019, 6:30pm

You should be able to use nvprune to remove PTX from the static cuFFT library that you are linking in.

njuffa · January 18, 2019, 6:40pm

Expanding on Robert Crovella’s point in #6: It is good practice to generate fat binaries that contain binary code for all architectures that the application is intended to run on, plus PTX code for the latest architecture among those. This ensures the binary remains functional on future architectures (there may be negative performance impact from JIT compilation for those architectures).

saulocpp · January 18, 2019, 6:46pm

Then what you guys say about cuFFT (or any other object with PTX) statically linked bringing the PTX into the lib explains all.
I follow tera’s signature at heart and have all the error-checking around my code, and (un)fortunately nothing is shown. And to make things worse, the second time I ran the program the kernels executed correctly…

Tera, I actually got this PTX removal advice from one of your/Njuffa’s answers in SO, and saw nvprune being mentioned. I will give it a try. But as long as it is cuFFT PTX there, if it is fine for NVidia, it is fine for me. :)

njuffa · January 18, 2019, 6:50pm

Presumably you are referring to this question on SO:

[url]How to remove all PTX from compiled CUDA to prevent Intellectual Property leaks - Stack Overflow

saulocpp · January 18, 2019, 7:03pm

This one, yes.

tera · January 18, 2019, 7:15pm

So, have you considered Njuffa’s comment there about removal of PTX not preventing property leaks?

saulocpp · January 18, 2019, 7:46pm

That was not the first thread I saw him makiong this comment.
There are genius programmers writing their CUDA stuff out there and possibly not worried about IP being broken. Why would I be the one concerned…

I have seen a few publications on this subject, all of them come to the same conclusion, summarized by Njuffa’s comment.

njuffa · January 18, 2019, 8:07pm

It is good to see confirmed that I am (at least sometimes) consistent in my statements :-)

saulocpp · January 19, 2019, 9:59am

Njuffa, and since you are here, what would be the difference of making available a PTX of CC 3.0 or 7.0 for future architectures? In the post you exemplify with CC 5.2.

njuffa · January 19, 2019, 10:09am

All necessary information can be found in the documentation. Really :-)

To embed PTX, you simply specify a virtual architecture instead of a physical architecture, e.g

-gencode arch=compute_70,code=compute_70

I can’t think of any reason to embed more than one PTX version in a fat binary. The point of including PTX is to make the code future proof as new hardware architectures reach the market that are not binary compatible to earlier GPU architectures. For that reason we want PTX for the latest supported architecture in the fat binary.

Topic		Replies	Views
How can I make a PTX fat binary from individual PTX files? CUDA Programming and Performance	4	313	May 11, 2024
PTX jit spills registers in trivial programs CUDA Programming and Performance	9	824	February 28, 2024
PTX in binary ? CUDA Programming and Performance	9	7763	June 20, 2011
cuFFT Callbacks in Shared Libraries GPU-Accelerated Libraries	3	982	April 24, 2018
cuFFT Callbacks With Host Compiler GPU-Accelerated Libraries	17	1308	May 5, 2019
VST - CUDA integration CUDA Programming and Performance	16	19933	April 29, 2010
Ubuntu 20.04, GCC 9.3, Cuda Toolkit 11.3 - not a supported combination? CUDA Programming and Performance	11	8879	November 4, 2021
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2607	February 11, 2015
Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w CUDA Programming and Performance	13	3181	October 12, 2011
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26837	January 19, 2009

[SOLVED] Trying to remove PTX from a shared library leaves it there, and kernels are not executed

Related topics