Compilation flags help

Hi,

I have been utilizing access to a P100 by collecting some metrics on how well the code executes. I ran into a run time issue using the p100 with regards to the optimization flag ‘-Xptxas -O2/-O3’. When I compile with -O2+ and execute the program, I get a bus error on the very first ‘cudaMemcpyFromSymbol’. This is the error message: ’ *** Break *** segmentation violation’, and cuda-gdb’s stack trace as follows:

#6  0x00002aaac19e3ab8 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#7  0x00002aaac1c4ef6a in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#8  0x00002aaac1c4efa9 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#9  0x00002aaac18b4260 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#10 0x00002aaac18bc03b in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#11 0x00002aaac1e9967d in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#12 0x00002aaac18bf3f4 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#13 0x00002aaac18c0ac8 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#14 0x00002aaac18b705c in __cuda_CallJitEntryPoint () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-ptxjitcompiler.so.361.93.02
#15 0x00002aaab2878602 in fatBinaryCtl_Compile () from /cm/local/apps/cuda/libs/current/lib64/libnvidia-fatbinaryloader.so.361.93.02
#16 0x00002aaab2031a62 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#17 0x00002aaab2032593 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#18 0x00002aaab1f8adce in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#19 0x00002aaab1f8b0b0 in ?? () from /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
#20 0x00002aaaaaf27a5d in ?? () from /cm/extra/apps/cuda80/toolkit/8.0.44/lib64/libcudart.so.8.0
#21 0x00002aaaaaf1be60 in ?? () from /cm/extra/apps/cuda80/toolkit/8.0.44/lib64/libcudart.so.8.0
#22 0x00002aaaaaf26cc6 in ?? () from /cm/extra/apps/cuda80/toolkit/8.0.44/lib64/libcudart.so.8.0
#23 0x00002aaaaaf2b401 in ?? () from /cm/extra/apps/cuda80/toolkit/8.0.44/lib64/libcudart.so.8.0
#24 0x00002aaaaaf14caa in ?? () from /cm/extra/apps/cuda80/toolkit/8.0.44/lib64/libcudart.so.8.0
#25 0x00002aaaaaf35711 in cudaMemcpyFromSymbol () from /cm/extra/apps/cuda80/toolkit/8.0.44/lib64/libcudart.so.8.0
#26 0x000000000046216c in DalitzVetoPdf::DalitzVetoPdf(std::string, Variable*, Variable*, Variable*, Variable*, Variable*, Variable*, std::vector<VetoInfo*, std::allocator<VetoInfo*> >) ()
#27 0x000000000042ccf9 in makeKzeroVeto() ()
#28 0x0000000000445a82 in makeOverallSignal() ()
#29 0x0000000000458bca in runCanonicalFit(char*, bool) ()
#30 0x0000000000429336 in main ()
===========================================================

Something is happening within the ‘fatBinaryCtl_Compile’, I think when it converts SM_20 to SM_60+ (?). I have tried enabling/disabling every option with respect to phase-stage of the PTX optimizer, but these don’t appear to make a difference. The only option I can’t change is ‘-abi=yes’, which I can’t currently correct easily. I’m unsure if issues listed above is a code or compiler issue. Using SM_60+ for the architecture produces the same error as above.

I have created a standalone in an attempt to address how this code works, but my standalone either doesn’t have the complexity or the cause is something completely unrelated.

Does anyone have any suggestions/advice on how to further debug this type of issue? Is ‘-O2’ composed of some other internal flags I can pass to find the specific problem related to this crash?

Thank you very much in advance,
-brad

As far as I am aware, there isn’t any fine-grained control of PTXAS optimizations. The most likely cause of the issue you are seeing is a bug in your code. It could be a violation of the CUDA programming model, invoking undefined C++ behavior, memory corruption, etc, that is exposed at higher optimization levels. A second, less likely scenario is a bug in the PTXAS optimizer (there are some of those in every release of CUDA). Does cuda-memcheck complain about anything?

It looks like you are using JIT compilation, which means the PTXAS component is coming from the driver, rather than from the standalone compiler. Are you using the latest CUDA driver? I would advocate not relying on JIT compilation unless you absolutely have to. There are indications that there a subtle differences between the PTXAS components of the driver and the standalone compiler, with the latter probably getting more usage and thus ore robust. It is also harder to look at the generated code when JIT is in use.

Instead of using JIT compilation, build a fat binary with canned machine code (SASS) for all required GPU architectures.

Hi njuffa,

Thank you for the response! Here are answers to your questions:

Crashes, cuda-memcheck reports 0 Errors.

The driver that is currently installed is 361.93.02. This is almost the newest version, I can look into requesting newest driver 361.93.03.

I am now compiling specifically for the p100 with:

nvcc -O3 --gpu-architecture=compute_60 --gpu-code=sm_60

I now receive the compilation error:

nvcc error   : 'ptxas' died due to signal 11 (Invalid memory reference)

Referring to a previous post on the topic that you replied to (https://devtalk.nvidia.com/default/topic/808186/nvcc-error-ptxas-died-due-to-signal-11-invalid-memory-reference-/?offset=4), changing to ‘-Xptxas -O1’ does fix the error.

The code is in ~20 files that are cat’ed into one file, a CUDAglob file that is compiled by nvcc (developed for 4.0, not updated yet). The files contain both CUDA and C++ classes, and I expect that this could be part of the problem.

A segfault inside the compiler is never an acceptable response. It may happen due to an issue with the source code, but that should result in a proper error message, not the compiler blowing up.

I would suggest reporting this as a bug to NVIDIA right away. The bug reporting form is linked from the CUDA registered developer website. To allow for speedy resolution, make sure you submit the simplest possible repro code with the bug report.

For now, I suggest you continue to use -Xptxas -O1 as a workaround. The compiler team might be able to suggest a more targeted workaround once they have determined the root cause, but in my experience that rarely happens.

I have reported the bug, ID: 1838084. Thanks njuffa!

I looked at the bug report you filed. Without a reproducer, it’s not likely to make much forward progress.

As I pointed out, repro code is essential to the bug resolution process. The first action in handling a bug report is to try and independently reproduce it, using the code and other information provides by the reporter of the issue.

It is possible that the issue is identified as a PEBKAC issue at that stage (not likely here), or that it cannot be reproduced in-house (e.g. due to missing software components, build instructions, or configuration data), in which case there will be iterations with the reporter until it can be reproduced. Only then is the process of root cause determination started.

This is why repro codes should be self-contained, and as small and as easy to set up as possible.

my bad, apparently the code in question is available on github

In that case, let’s hope that the compiler team can have a look soon. Segfaults in tools never look good from the customer perspective. Therefore: valgrind early, valgrind often :-)