Undefined behavior in device code

Hi,

I am experiencing a very weird issue. To briefly describe my setup: from a ray generation program I am (through some function calls) making a trace call to a two-level AS with one level of instancing and a custom primitive. In the intersection program, I use the instance index of the hit primitive as an index to an array of structs. I then call a function and pass the found struct along. In this first function I pass a variable to a second function. In this second function, I check whether the value of this variable is still the same using a home-made assertion macro. The macro works by printing that an error has occurred and causing an interrupt by dereferencing a null pointer to terminate the program. The problem now is that the assertion fails. This should not be the case, since no modifications are made to this value between passing it and asserting it.

The interesting part now is that whether the assertion in second function fails or succeeds, depends on factors that should not contribute to the fact that it succeeds or not. For example, adding a single print statement in the first function or forcing the first function to not inline (with __noinline__) results in the assertion succeeding. Otherwise, the assertion fails and the program thus crashes. My initial guess was that maybe somewhere data is written outside of bounds (and may write over the program code), but over the past two days I could not identify such an issue.

My problem looks very similar to this post. There, the suggestion is made that the cause of poster’s issue may be a compiler bug. A possible solution that is offered is to disable optimization and see if the issue still persists. I am using the CMake sample framework from the SDK, and have tried doing this by inserting the following two lines to the top-level CMakeLists.txt, on line 210:

list(APPEND CUDA_NVCC_FLAGS -Xptxas -O0)
list(APPEND CUDA_NVRTC_FLAGS -Xptxas -O0)

However, this did not seem to make a difference. I also made sure I cleaned the OptiX cache and reconfigured and regenerated CMake from scratch. My first question is: is this the correct way to disable optimization of device code compilation? I am not sure since the whole CMake setup of the SDK samples is quite involved.

My second question is: what (else) could cause such an issue? Like I said, I have not been able to find occurrences of memory being written outside of bounds, but are there any other problems that could cause something like this? I understand that my explanation of the situation is not great but I cannot reliably reduce the code down to a reproducible example (as removing code makes the crash disappear). I am also not at liberty to share the entire code (not that doing so would help, because relevant part is quite large).

By the way, this is on OptiX 7.3, CUDA 11.1, and on an RTX 2070 GPU.

Hey @nolmoonen ,

Very hard to say what might be happening based on the description. You said you’re indexing into a struct based on primitive ID, right? So you do expect for multiple threads in a warp to access the same struct at the same time?

Is the variable you’re setting / testing in the struct, or separate? Does the variable live in local or global memory, or exist as a payload or attribute register? Is the variable part of an array, or is it a single standalone named variable? How do you know whether it’s changed - are you saving the value somewhere else, or hard-coding the test value?

Another thing you might try is marking the variable volatile to force it not to be cached.

One possible way to try to build a minimal reproducer for this without having to share your code is to modify one of the OptiX SDK samples using the very same setup.

If you’re using the cmake setup in the OptiX SDK, then to make a debug build without optimizations, you should set CMAKE_BUILD_TYPE to “Debug”, and then make sure your OptiX code uses OPTIX_COMPILE_OPTIMIZATION_LEVEL_0 and OPTIX_COMPILE_DEBUG_LEVEL_FULL. OptiX validation mode should be enabled once in a while to see if it uncovers anything you didnt’ already know, but it should generally be disabled while you’re reproducing and tracking down the error.


David.

My first question is: is this the correct way to disable optimization of device code compilation?

The ptxas settings will not have any effect. The PTX assembler is not involved at all when compiling CUDA source code to PTX source code.

In the intersection program, I use the instance index of the hit primitive as an index to an array of structs. I then call a function and pass the found struct along. In this first function I pass a variable to a second function. In this second function, I check whether the value of this variable is still the same using a home-made assertion macro.

What means you pass the found struct along?
You have an index to access that struct. Isn’t that index enough to access that struct elsewhere? That wouldn’t even need an attribute inside the custom intersection program then because the instance index of the hit primitive is available in all hit record functions (IS, AH, CH).

A little code with that structure, the intersection’s optixReportIntersection arguments (attributes) and some pseudo code of the argument passing between the functions would be helpful.

The macro works by printing that an error has occurred and causing an interrupt by dereferencing a null pointer to terminate the program.

There is a much cleaner way to implement a custom assertion.
You should have an exception program running in debug builds which reports the exception code
https://raytracing-docs.nvidia.com/optix7/guide/index.html#device_side_functions#exceptions
Something like this:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/exception.cu

You can then implement your own optixAssert(bool condition) which uses one of the optixThrowException() overloads to throw a user defined exception code and potentially more details you can then get with the optixGetExceptionDetail_*() functions inside the exception program.

For that to work you’d need to enable the user exceptions inside OptixExceptionFlags of the OptixPipelineCompileOptions.
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/src/Device.cpp#L570

With that method, OptiX will cleanly handle these exceptions without getting the CUDA context into an unrecoverable state.

I never use __noinline__ in my OptiX programs. OptiX prefers to inline code for better performance.
Unless for domain programs (RG, EX, IS, AH, CH, MS) which must be extern "C" __global__ or callables (DC, CC) which must be extern "C" __device__, I define every other device function as __forceinline__ __device__ to have the compiler inline it. I do not use only __inline__ because that is just a hint to the compiler and it may decide to not inline code if the number of arguments or the body of the function gets bigger.

Hi, thank you both for you replies. To answer your questions/remarks:

I should have clarified: I do not write to this struct, I only read its values. I do expect multiple threads in a warp to access the same struct, but this should not be a problem as no writes happen. The variable I am setting and testing is separate from the struct I find with the primitive ID. It lives in local memory and is part of another struct. I am hard-coding the test value. See the bottom of this post for a code example. I also believe using volatile should not help as the variable is not written to.

I am currently trying to do this but have had not success so far. I understand this would be the best way to proceed.

With my initial problem description, I use CMAKE_BUILD_TYPE “Release” and the assertion fails. Validation mode is off, the optLevel is default, and the debugLevel is line info.

When I switch the CMAKE_BUILD_TYPE to “Debug” and all other settings remain the same, the assertion still fails. When I set the optLevel to OPTIX_COMPILE_OPTIMIZATION_LEVEL_0 and debugLevel to OPTIX_COMPILE_DEBUG_LEVEL_FULL, the assertion succeeds and the program runs without issue. 0 is the highest optimization level for which it will work, for 1 it does not.

And to Detlef’s remarks:

Okay I see, I removed this and disabled optimization via the OptixModuleCompileOptions as stated above.

I realize that the initial description was too vague, see below for a more detailed setup.

Thank you for the suggestion, but I believe my current setup is sufficient to show the problem.

I understand this and I also use __forceinline__ __device__ for all my functions, I only used __noinline__ to see if that made a difference regarding the assertion, which it did.

Now for a better indication of the setup:

// other_function.h
// this function has multiple other parameters, not all data is packed into the 
// struct called 'helper'
__forceinline__ __device__ float other_function(helper* d)
{
    // code generating a float value based on all parameters
    ...
    
    // this assertion fails
    ASSERT(d->val == 1.f);

    ...

    return 0.f; // this is not 0, but the generated value
}

// function.h
__forceinline__ __device__ void eval(prd *p, const foo *f)
{
    helper d;

    // large amount of code generating data based on p and f
    ...

    // for testing, i manually override this value
    d.val = 1.f;
    float t = other_function(&d);
    // do something based on t
    ...
}


// intersection.cu
extern "C" __global__ void __intersection__test()
{
    const unsigned int idx = optixGetInstanceIndex();
    // params.foos is an aos in global memory
    const foo *f = &params.foos[idx];
    
    // actual intersection test happens here based on f and world ray
    // if not intersecting, this function returns
    ...

    // pointer to prd is obtained from ray payload
    prd *p = getPRD();
    eval(p, f);
}

To connect back to the original issue, forcing eval to not inline or adding printf(""); somewhere in eval causes the assertion to succeed. Furthermore, changing the optimization level to 0 also causes the assertion to succeed.

Yup, that sounds like a compiler bug.

I assume you’re still on 471.41, so there is no newer driver to try.

There is nothing much to analyze without a complete and minimal reproducer in failing state for such cases,

(It would have been interesting to see where the optixReportIntersection is in that intersection program.)

Hi,

I have been able to narrow down the problem to an as minimal as possible reproducer. To reproduce this issue, create a file called tmp.h in the optixPathTracer directory with the contents: pastebin. Prepend the optixPathTracer.h file with pastebin and the following line to the end of the Params struct in optixPathTracer.h:

s3 *d_s3is;

Now, add the following lines to the top of the launchSubframe method of optixPathTracer.cpp:

CUDA_CHECK(cudaMalloc(
        reinterpret_cast<void **>(&state.params.d_s3is),
        sizeof(s3)));
s3 s3i0;
s3i0.v6.v2 = false;
s3i0.v6.v3 = true;
CUDA_CHECK(cudaMemcpy(
        reinterpret_cast<void *>(state.params.d_s3is),
        &s3i0,
        sizeof(s3), cudaMemcpyHostToDevice));

and the following line to the end of the same method:

CUDA_CHECK(cudaFree(reinterpret_cast<void *>(state.params.d_s3is)));

Finally, add the line:

eval();

to the top of the ray gen method (__raygen__rg) in optixPathTracer.cu and replace in the same file

#include <optix.h>

#include "optixPathTracer.h"
#include "random.h"

#include <sutil/vec_math.h>
#include <cuda/helpers.h>

extern "C" {
__constant__ Params params;
}

by:

#include <optix.h>

#include "tmp.h"
#include "random.h"

#include <sutil/vec_math.h>
#include <cuda/helpers.h>

(note that the params are removed, I define these in tmp.h) If you now run optixPathTracer (I have done so with a CMake “Release” build), the result should be a failed assertion and a crash as a result. If you un-comment the assertion on line 56 of tmp.h, the program should run fine. I have annotated the problem with comments the best I could. I these instructions for reproducing are unclear (or anything else about the problem), please let me know. I would have uploaded the full sample on pastebin but I was not sure if distributing the sample code is allowed.

To also respond to your remarks:

Yes I am. I thought that this may be a CUDA-related issue, but from what I understand, OptiX 7.3 only works with CUDA 11.1 so I cannot upgrade the CUDA version.

There is no optixReportIntersection in the original intersection program. I just directly perform calculations in the intersection program. That pipeline has no any- or closest-hit programs.

I cannot access that pastebin links due to “This site cannot be reached.” errors. That might be blocked by our IT.
Please attach the files (e.g. as *.zip) directly to the forum itself with the “Upload” button inside the editor toolbar (the icon with the up arrow).

I cannot upgrade the CUDA version.

Although OptiX 7.3.0 is built with CUDA 11.1 the programming interface is PTX source, so any PTX OptiX can parse will work. It’s just that it sometimes cannot parse unexpected PTX code constructs from newer compilers.
Since OptiX 7 core implementation lives inside the display driver, newer driver versions might have solved that. (Not sure, I’m still using CUDA 11.1.) There is also CUDA 11.4 already and the CUDA driver in 471.41 already supports that version.

You could give it a try, but it can also be that the input PTX code hasn’t much to do with that. The input PTX code gets translated and rewritten to the actual kernel and the CUDA driver will translate and optimize the kernel to the final microcode. Means there are multiple compilation and code generation steps involved and any of them cold be responsible for what you’ve hit.

You can have arbitrary many CUDA Toolkits installed on your development machine and select one with the CUDA_PATH environment variables. Do not install the display drivers from the CUDA toolkits. Recently explained here:
https://forums.developer.nvidia.com/t/invalidaddressspace-when-using-pointer-from-continuation-callable-parameters/184951/4

There is no optixReportIntersection in the original intersection program. I just directly perform calculations in the intersection program. That pipeline has no any- or closest-hit programs.

Ok, I didn’t realize that. I was expecting more standard behavior.

Okay understandable, see attachment:
optixPathTracer.zip (15.3 KB)

I will try upgrading the CUDA Toolkit version to 11.4 and see if that helps, but I am also interested in whether you are able to reproduce this issue.

I don’t get an assert with your code unmodified. Should I if it repros? I’m on CUDA toolkit 11.1, and driver 471.41 on Windows.

Have you tried changing the bools to ints: s0.v2 and s0.v3? I’ve seen cases in the past where bool packing and alignments do not compile correctly. Given that and the dangers of mixing bools with std::vector accidentally, I generally try to avoid bool in device code for now.

Also have you tried reordering your structs? Given than they’re not ordered in largest to smallest, perhaps there’s a size or padding mismatch between host and device. See if playing with the order has any effect on whether you can reproduce the issue.


David.

BTW, if the size or padding of bools is the problem, then I think it might be possible that sizeof(s3) could be different on host and device? You might try printing it out on both sides during your repro to check.

You’d want to protect against printing it in every device thread, so do something like this in eval():

    uint3 idx = optixGetLaunchIndex();
    if( idx.x == 0 && idx.y == 0 )
    {
        printf( "DEVICE sizeof(s3)=%llu\n", sizeof( s3 ) );
    }

I get these sizes on both ends:
s0 : 8 bytes
s1 : 36 bytes
s2 : 104 bytes
s3 : 144 bytes

There are 2 cases of implicit padding in your setup:

s0 has 6 bytes of data and 2 bytes of padding
s3 has 140 bytes of data and 2 bytes of padding in s3 and 2 bytes of padding in the included s0.

If the compiler is getting confused about padding, then it’s in one of those two places. If you use ints instead of bools, it will eliminate the implicit s0 padding. If you put an explicit pad value in s3 after the unsigned chars, or if you move the unsigned chars to the end, or if you use unsigned ints, any of those will eliminate the other implicit padding in s3.

If you change those and it fixes the error, then that probably confirms this is a compiler bug having to do with either miscalculated padding or relating to bools. If not, then we’ll still need to figure out how to reproduce on our end with your sample.


David.

Yes, you should get an assert with the unmodified code (a print in the console indicating that the assertion has failed and a error in the kernel). I’m on the same CUDA and driver version, and I am also on Windows.

Changing bools s0.v2 and s0.v3 to ints does indeed make the problem go away.

Changing s0.v2 and s0.v3 to chars does not make the problem go away. Nor does it go away when I use chars for both variables and add a 2-char padding at the end of the struct, or a 3-char padding after both variables.

Sorry, what do you mean by this? Do you mean that std::vector<bool> is “space-efficient” and may lead to a differently sized memory allocation than sizeof(bool) * arr_size?

Only arranging s3 such that the members are ordered from largest to smallest does not make a difference. Nor does only arranging s2, or arranging both s2 and s3.

I too get the following sizes on both device and host:
s0 : 8 bytes
s1 : 36 bytes
s2 : 104 bytes
s3 : 144 bytes

Using unsigned ints for v3 and v4 in s3 (and leaving s0 unmodified, so using bools there) has no effect on whether the problem occurs. Using unsigned ints for v3 and v4 in s3 and arranging the members of s2 and s3 from largest to smallest does not make a difference.

Altogether, I find this very confusing. The only thing that seems to help is to change the bools in s0 to ints. However, there are no differences in the sizeof structs on device and host. Also, I checked whether the sizeof(void *) is the same both on device and host (it is, both 8).

Futhermore, I find it weird that this issue does not occur to you. Maybe this difference is caused by different host compilers? I am using MSVC 19.29.30037.0

Hi, I installed CUDA 11.4 and this seems to alleviate the issue. Like you suggested, I did not install the display drivers for CUDA 11.4 (so disabling everything under “Driver components” in the network installer). When I run the above provided code through the sample framework, it automatically selects 11.4, with which it works fine (like it should). When I disable 11.4 and use 11.1 again instead, the program crashes like I have described previously.

I am glad that this has solved the problem. I hope that I can safely assume that this issue was fixed in the newest CUDA version (and that the issue is not caused by something unrelated).

As a final question: should I avoid bools in device code in all locations? Or does your recommendation only apply to bools in structs?

Maybe this difference is caused by different host compilers? I am using MSVC 19.29.30037.0

That is possible, I’m on version 19.28.29336 for x64. But I’d speculate it’s more likely either a real case of undefined behavior (like a legitimate mismatch between host and device layouts), or a compiler bug on our side. The last time I saw issues with bools it was a problem with the OptiX compiler specifically, not CUDA in general, but this issue could be entirely different.

Changing s0.v2 and s0.v3 to chars does not make the problem go away. Nor does it go away when I use chars for both variables and add a 2-char padding at the end of the struct, or a 3-char padding after both variables.

This is interesting, especially that char doesn’t work. (Even unsigned char?) It sounds like the padding mismatch is not the issue, so it’s something else specific to bools and/or bytes, but maybe not the size.

There are a couple of other things you could try to triage, if you still want to try to get to the bottom of the issue. One is to try a short int, e.g., uint16_t. The other thing to do would be to pad the memory around your bools with sentinels, or just examine the entire struct including any padding, and then examine all of the memory, bools and sentinels/padding, on the device side when it’s failing, to see exactly which bits are getting touched. Your assert check is reading specific bits but not detecting if the bits nearby are changing when they shouldn’t.

Do you mean that std::vector<bool> is “space-efficient” and may lead to a differently sized memory allocation than sizeof(bool) * arr_size ?

Yes, exactly - unlike with std::vectcor<int> (or any other element type), you can’t treat a std::vector<bool> as an array of bool, and if you copy one of these from the host, the device side is likely to mismatch.

As a final question: should I avoid bools in device code in all locations? Or does your recommendation only apply to bools in structs?

Without knowing what the root cause is, I would say just be careful / suspicious with any bools that need to exist on both host and device. Maybe the best way to summarize is any bools that are touched by code generated by more than 1 compiler. You can always safely use bools in pure device code, even in structs, any time the lifetime of the bool is limited to GPU code and doesn’t get copied to or from the host and used by the CPU.


David.

To clarify, I’m also on x64. And no, even unsigned char does not work. short and unsigned short do work (I could not manage to get standard types to work in OptiX device code).

To investigate further, I printed the memory contents of the s0 struct using:

auto *p = reinterpret_cast<unsigned char *>(v);
for (int i = 0; i < sizeof(s0); i++) printf("%02x ", p[i]);

where v holds a pointer to an instance of the struct. The first print is on the host, printing the contents of params.d_s3is->v6:

host                      : 00 00 00 00 00 01 00 00

Which is as expected. I assume the first four bytes hold the float (which is undefined at this point). v2 is probably represented by the fifth byte and v3 by the sixth byte. This still makes sense, as v2 is initialized on the host as false and v3 as true. The second print is on the device at the top of the eval method, printing the contents of params.d_s3is->v6:

device (before assignment): 00 00 00 00 00 01 00 00

which still makes sense. The third print is after the assignment, the assignment being (the same as my previous post, included for completeness):

s0i1.v1 = (s3i0->v6.v3 && s3i0->v6.v2) ? 0.f : 1.f * fabsf(-1.f);

where s0i1 is initialized as params.d_s3is->v6. So, based on the contents of params.d_s3is->v6, we would expect s0i1.v1 to not be 00 00 00 00 after this assignment. However, the print of s0i1 shows:

device (after  assignment): 00 00 00 00 00 01 00 00

indicating that s0i1.v1 is unchanged. Then, on the top of the loop I print the contents of curr. In the first iteration of the loop, curr points to s0i0, in the second iteration it points to s0i1. The print statements read:

loop                      : 00 00 80 3f 00 00 00 00
loop                      : 00 00 00 00 00 01 00 00

after which the assertion fails (so on the second iteration of the loop). To be complete, the assertion is:

GPU_ASSERT(curr->v1 != 0.f || (curr->v3 && curr->v2));

I hope this information helps. When I have the time, Ill try and investigate further. Also, thanks for the clarification on host and device bools.

Thanks for digging in further! That’s the analysis I was hoping for, but I’m stymied as to what is going on. I would love to repro this on my end so we can file a bug report and pin down the root cause without asking you to debug more.

Maybe the assignment is being compiled out for some reason… maybe the effects of the assignment or overlapping, or are happening outside your print window… I was suspecting some kind of size or layout mismatch between host and device, because that’s what I remember from last time I saw an issue with bools, but all your evidence seems to be pointing to something else, I think.

I’m glad the new CUDA version works, and that you also have an alternate workaround using shorts or ints. So don’t worry too much about investigating more. The bug might be fixed already, but just in case there’s still a latent bug that needs fixing, if you do happen to figure out anything else, I’m interested in hearing about it. I will poke at it a little more on my end this week and post if I see anything else.


David.

Did you try to change the alignment in struct s3? Maybe I’m wrong, but I think some elements are not properly aligned in there:

struct s3 {                @hex
    float3 v0;             @0000
    unsigned int v1;       @000C         ok, uint1 aligned on a multiple of 4
    float3 v2;             @0010         ok, float3 aligned on a multiple of 4
    unsigned char v3;      @001C         ok, uchar1 aligned on a multiple of 1
    unsigned char v4;      @001D         ok, uchar1 aligned on a multiple of 1

         // possible solution: insert 2 more chars here to get a valid alignment of a multiple of 8

    // s2 v5; ===> 
    // class s2 { 
      union {
        const void *v0;    @001E  !!! 64bit pointer (ulonglong1)  not aligned on a multiple of 8 !
        const void *v1;
    } v0;

      //  s1 v1;   ===>    @0026     
      //  struct s1 {
       float3 v0;          @0026     float3 here not aligned on a multiple of 4 !!!!
       float3 v1;          @0032     and all following float3, float and int are not properly aligned
       float3 v2;          @003E
      //};    
    
    float3 v2;             @004A
    float3 v3;             @0056
    int v4;                @0062      
    float3 v5;             @0066
    float v6;              @0072
    float3 v7;             @0076
    float v8;              @0082
   //   };


    // s0 v6;  ===>
     // struct s0 {
      float v1;            @0086
      bool v2;             @008A
      bool v3;             @008B              
    //   };
                           @008C   = 140
};

alignment rules see: Programming Guide :: CUDA Toolkit Documentation

nolmoonen wrote:
[…]Using unsigned ints for v3 and v4 in s3 (and leaving s0 unmodified, so using bools there) has no effect on whether the problem occurs. Using unsigned ints for v3 and v4 in s3 and arranging the members of s2 and s3 from largest to smallest does not make a difference.[…]

changing them to uints / ints also would not fullfill the alignment rules:
0x001C + 8 = 0x0024 (=36) this is a multiple of 4, but not a multiple of 8 !