[SOLVED] Trying to remove PTX from a shared library leaves it there, and kernels are not executed

I am writing a shared library to be dynamically loaded by a program (as a plugin) and am trying to statically link cuFFT into the lib.
The cuFFT code is being compiled with the parameters indicated in the documentation for static linkage:

nvcc -O3 -Xcompiler -fPIC -lcufft_static -lculibos cufft_code.cu -c -o cufft_code.o

The shared library is compiled as follows:

nvcc -O3 some_code.o cufft_code.o -Xcompiler -fPIC -ldl --shared -o my_lib.so

In principle everything compiles without errors, even though my_lib.so has only 700KB and I was expecting it to be at least 100MB due to the static linking of cuFFT. But ok, let’s try to use it in a program:

void *lib_handle = dlopen("my_lib.so", RTLD_NOW | RTLD_GLOBAL);
if(lib_handle == NULL)
    {
    cout << "Unable to load my_lib.so: " << dlerror() << endl;
    return -1;
    }

Then dlerror() returns “undefined symbol: cufftDestroy”.
There is definitely something wrong/missing in my compilation process. Do you guys have any idea?

Well, generating the shared library also required the static linking arguments:

nvcc -O3 some_code.o cufft_code.o -lcufft_static -lculibos -Xcompiler -fPIC -ldl --shared -o my_lib.so

This fixed, now my_lib.so is “properly” sized at 120MB and the library is opened without error.

I will hijack my own thread to ask something else related to the compilation process.
In order to remove the PTX code, following the many SO threads showing how to do it, I added this to my Makefile:

CUDA_ARCHS_1	= -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37
CUDA_ARCHS_2	= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52
CUDA_ARCHS_3	= -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61
CUDA_ARCHS_4	= -gencode arch=compute_70,code=sm_70
CUDA_ARCHS	= $(CUDA_ARCHS_1) $(CUDA_ARCHS_2) $(CUDA_ARCHS_3) $(CUDA_ARCHS_4)

Then my linking line is:

nvcc $(CUDA_ARCHS) -O3 some_code.o cufft_code.o -lcufft_static -lculibos -Xcompiler -fPIC -ldl --shared -o my_lib.so

When I do a cuobjdump --list-ptx my_lib.so, it outputs 93 lines in the form:

PTX file   93: libGeneSeis.93.sm_70.ptx

Then I run a test program that calls functions of this lib, and my kernels are not executed (though the cuFFT calls are). The kernels work fine it I don’t use the -gencode pairs above.
Can you guys explain what could be wrong here, considering that the PTX codes shouldn’t be there and that my kernels are not executed in this configuration?

I just did a quick experiment with your -gencode switches on a single source file, and I don’t see any PTX embedded in the resulting object file.

This suggests that your object files were built with different switches than the one you are showing, and that you would want to inspect your build log to see what switches are actually being passed to nvcc when the object files are created.

Any third-party object files or libraries you link into your binary could likewise contain PTX code.

Njuffa, sorry that I didn’t mention. I also used these options on a test program (this sample code when one creates a Nsight project), and I also don’t see the PTX code in the resulting executable.

One of my objects is compiled with:

nvcc -O3 -Xcompiler -fPIC some_code.cu -c -o some_code.o

And another object, that contains cuFFT calls, is compiled with:

nvcc -O3 -Xcompiler -fPIC -lcufft_static -lculibos cufft_code.cu -c -o cufft_code.o

The linking line has the -gencode options and the -ldl --shared switches, as seen above. I haven’t seen anywhere that when a shared lib is created, the PTX can’t be removed. So I assume a mistake somewhere in these compilation commands.

The cufft library itself has PTX embedded in it. If you statically link to cufft, any such binary will have PTX in it.

Of course I don’t think any of that explains why a code is or is not working.

If you strip out PTX, then your code won’t run except on the specific architectures you enumerated. So one possible explanation for “not working” is that the architecture you are running on is not one of the ones you enumerated.

if your kernels are not being executed, using proper CUDA error checking and/or running your code with cuda-memcheck will likely help to pinpoint the issue. For example, if you were running into the issue that I described, you would get a message along the lines of “no binary for GPU”

Failure to use proper CUDA error checking and also failure to test your code with cuda-memcheck, but writing a request here for help, is a waste of your time (and others) in my opinion.

The only way PTX code can get into a binary (executable or shared library) is if it is present in the object files that were linked together when creating the binary. This suggests you need to examine the object files for PTX content, then determine why they contain PTX.

Maybe try a make clobber to nuke all old object files that may still be lying around, and double check the switches actually used by nvcc to build the object files.

You should be able to use nvprune to remove PTX from the static cuFFT library that you are linking in.

Expanding on Robert Crovella’s point in #6: It is good practice to generate fat binaries that contain binary code for all architectures that the application is intended to run on, plus PTX code for the latest architecture among those. This ensures the binary remains functional on future architectures (there may be negative performance impact from JIT compilation for those architectures).

Then what you guys say about cuFFT (or any other object with PTX) statically linked bringing the PTX into the lib explains all.
I follow tera’s signature at heart and have all the error-checking around my code, and (un)fortunately nothing is shown. And to make things worse, the second time I ran the program the kernels executed correctly…

Tera, I actually got this PTX removal advice from one of your/Njuffa’s answers in SO, and saw nvprune being mentioned. I will give it a try. But as long as it is cuFFT PTX there, if it is fine for NVidia, it is fine for me. :)

Presumably you are referring to this question on SO:

https://stackoverflow.com/questions/41874155/how-to-remove-all-ptx-from-compiled-cuda-to-prevent-intellectual-property-leaks

This one, yes.

So, have you considered Njuffa’s comment there about removal of PTX not preventing property leaks?

That was not the first thread I saw him makiong this comment.
There are genius programmers writing their CUDA stuff out there and possibly not worried about IP being broken. Why would I be the one concerned…

I have seen a few publications on this subject, all of them come to the same conclusion, summarized by Njuffa’s comment.

It is good to see confirmed that I am (at least sometimes) consistent in my statements :-)

Njuffa, and since you are here, what would be the difference of making available a PTX of CC 3.0 or 7.0 for future architectures? In the post you exemplify with CC 5.2.

All necessary information can be found in the documentation. Really :-)

To embed PTX, you simply specify a virtual architecture instead of a physical architecture, e.g

-gencode arch=compute_70,code=compute_70

I can’t think of any reason to embed more than one PTX version in a fat binary. The point of including PTX is to make the code future proof as new hardware architectures reach the market that are not binary compatible to earlier GPU architectures. For that reason we want PTX for the latest supported architecture in the fat binary.