'ptxas' died due to signal 11 error!

Good Evening,

So, I’m trying compile some code and for some odd reason, using a printf() statement in my kernel causes the compiler to die. I googled my error and apparently this is indicative of a compiler bug which I find interesting. I’m not sure if it is or not.

But I’m getting this output from nvcc :

make
nvcc -O3 -lstdc++ -rdc=true -gencode arch=compute_50,code=sm_50 -c tetra.cu
nvcc error   : 'ptxas' died due to signal 11 (Invalid memory reference)
nvcc error   : 'ptxas' core dumped
Makefile:18: recipe for target 'tetra.o' failed
make: *** [tetra.o] Error 139

And this is the kernel that is causing it all :

typedef float real;

const int tpb = 256; // threads per block
const int bpg = 512; // blocks per grid

struct point
{
    // Use a union as hybrid storage

    union
    {
        struct
        {
            real x, y, z;
        };

        real p[3];
    };

    __host__ __device__
    point(real a, real b, real c) : x(a), y(b), z(c) { };

    __host__ __device__
    void print(void) const
    {
        printf("(%.00f, %.00f, %.00f)\n", x, y, z);
    };
};

struct tetrahedron
{
    int v[4]; // list of vertices in point buffer

    __host__ __device__
    tetrahedron(int v0, int v1, int v2, int v3)
    {
        v[0] = v0; v[1] = v1; v[2] = v2; v[3] = v3;
    };
};

__global__
void fracture(const int n,
              const unsigned char *location_code,
                    tetrahedron *tetrahedra,
              const int *tet_index,
              const int *insertion_marked,
              const int num_tetrahedra,
              const int *pt_index)
{
    const int thread_num = threadIdx.x + blockIdx.x * blockDim.x;

    for (int tid = thread_num; tid < n; tid += blockDim.x * gridDim.x)
    {
        const unsigned char loc = location_code[tid];

        if (loc != 0)
        {
            // To fracture, we need the tetrahedron for its face data

            const tetrahedron t = tetrahedra[tet_index[tid]];

            const int faces[4][3] = { { t.v[3], t.v[2], t.v[1] },
                                      { t.v[0], t.v[2], t.v[3] },
                                      { t.v[0], t.v[3], t.v[1] },
                                      { t.v[0], t.v[1], t.v[2] }
                                    };

            tetrahedron *address = tetrahedra + tet_index[tid];
            int pos = 0;

            for (int i = 0; i < 4; ++i)
            {
                if (loc & (1 << i))
                {
                    const tetrahedron *tmp = 
                        new(address) tetrahedron(faces[i][0],
                                                 faces[i][1],
                                                 faces[i][2],
                                                 pt_index[tid]);

                    address = tetrahedra + num_tetrahedra
                            + insertion_marked[i] + pos;
                    ++pos;

                    // This is the print that causes the compiler to die.

                    //printf("%d, %d, %d, %d\n", tmp->v[0], tmp->v[1], tmp->v[2], tmp->v[3]);
                }
            }
        }
    }
}

Edit :

Apparently, this code will work though :

if (loc & (1 << i))
                {
                    const tetrahedron *tmp = 
                        new(address) tetrahedron(faces[i][0],
                                                 faces[i][1],
                                                 faces[i][2],
                                                 pt_index[tid]);
                    printf("%u\n", (unsigned long ) tmp);
                    address = tetrahedra + num_tetrahedra
                            + insertion_marked[i] + pos;
                    ++pos;
                    
                    //printf("%d, %d, %d, %d\n", tmp->v[0], tmp->v[1], tmp->v[2], tmp->v[3]);
                }

Output :
30932992
30933008
30933024

Can you provide a code that is complete? Something that I can copy, paste, compile, and see the issue, without having to add anything or change anything? I tried to compile what you have shown, and I get this:

t590.cu(78): error: function “operator new” cannot be called with the given argument list
argument types are: (unsigned long, tetrahedron *)

Alright, the project can be found on github here : https://github.com/LeonineKing1199/regulus_v1.5

I think this should build with just the “make” command in Linux.

Problem kernel is in tetra.cu at the top.

I downloaded GDelShewchukDevice.h, structures.h, predicates.h, and tetra.cu

then I ran:

nvcc -O3 -lstdc++ -rdc=true -gencode arch=compute_50,code=sm_50 -c tetra.cu

and it compiled fine, with no errors or warnings. CUDA 6.5, Fedora 20

Which CUDA version are you using? And on what OS?

OK I got it. Have to uncomment the printf.

Oh yeah, sorry; I didn’t ship it broken. I wanted to make sure I would only release something that really could compile successfully XD

I have no idea why that printf() kills the compiler. I can print other stuff in that same kernel, I just can’t seem to read from the address. But that should be a run-time error, not a compiler one.

Thanks for reporting the issue. A side effect of commenting out the printf is (when optimized) some dead code removal around the stuff associated with tmp. I believe this is the reason for the change in behavior when you comment/uncomment that printf. However that in no way explains why the ptxas dies.

Separately, I’ve observed that if I compile with compute_35,sm_35 instead of compute_50,sm_50, then the compile is successful.

I’ve reported a bug with NVIDIA. If I learn anything relevant I will report back.

Hey, thank you!

Also, I would laugh so hard but do you think it might be because the optimizer trims the code away and then I’m trying to print it out?

Like, the code should only appear when I’m printing (or storage of tmp because all I’m really after is the constructor’s effects and not a copy of where all this is happening) so what if the optimizer was trimming the storage while still trying to print it at the same time? I can print other variables. Like, I can print tmp itself.

But I think i I don’t make a direct call to tmp, the compiler trims tmp off and then crashes when it wants to read from an invalid address. I’m calling tmp->v, not tmp itself would be the key distinction.

I’m not following your logic. The point I was trying to make (which I don’t think is a very important one) is as follows:

The variable tmp is local in scope to the body of this if statement:

if (loc & (1 << i))

When the printf is in the body of that if statement, then the tetrahedron object pointed to by tmp is actually used, by the printf. In this instance, I’m not suggesting there is any dead-code removal, and we do observe the ptxas crash.

When the printf is not in the if statement body, then there is no code anywhere that depends on tmp. In this case, the allocation of tmp and creation of a tetrahedron object at this point is dead code, and can safely be removed by the compiler. And we observe a difference in behavior of ptxas. I’m suggesting that the code that gets removed is somehow (in some perhaps indirect way) the issue (when it is present - crash, when it is not present - no crash), and not likely the actual printf statement itself. This last bit is just speculation.

I’m not suggesting I understand your code fully, but a general rule of thumb is that ptxas should never crash. Combined with the fact that the compiler/ptxas seems to be happy with sm_35 generation instead of sm_50 generation suggests to me a ptxas/compiler bug (as opposed to some defect in your code).

I don’t have any further insight at the moment. You are likely a much smarter programmer than I am, and I’m certain that the compiler engineers at NVIDIA are much smarter than I am. If I learn anything relevant, I’ll report back.

But I was though!

I am most likely wrong though. It was just a fun thought. I have no idea why the compiler dies doing it.

I think maybe you misread. I am saying “we do observe the ptxas crash”.

With the printf: no dead code removal, crash is observed
without the printf: dead code removal, no crash

As a rule of thumb, whenever a compiler component segfaults [as PTXAS did here], or produces an error message that mentions an “internal compiler error”, it is a good idea to go ahead and file a bug via the form on the registered developer website (provided you are using the latest released tool chain).

While such errors may be ultimately be caused by invalid or otherwise problematic source code being compiled, abnormal terminations should be brought to the attention of the compiler team, as broken source code should cause a proper error message to be emitted, not crash the compiler.

But my code is perfect…

(This is a joke post, btw)

Even perfect code can trigger compiler bugs that manifest as segfaults :-)

Yes, I understand you’re joking. My point is, it does not matter how perfect or how broken a given piece of CUDA code is, if the compiler segfaults or reports an internal compiler error, this is not an acceptable response by the compiler, and such instances should be reported as bugs to NVIDIA.

Any updates from the nVidia people?

Is there also a link to the filed bug?

The bug is not an external one. You’re welcome to file your own bug, in which case you’ll have some trackability of your own. The way these things go, I really don’t find out much until the next release (CUDA 7) is imminent. If you need an immediate workaround I would suggest compiling for sm_35. If you specify -arch=sm_35 instead of the more elaborate forms of target specification, your code will run on any maxwell also.

The bug I filed is 1562955 if you want to refer to it in correspondence. But you won’t be able to access it directly yourself using that number.

Eh, I’ll just trust my code without testing…

I know, it sounds crazy but believe me, I’ll find out really, really quickly if my data is bad. Plus, I still have the use of assert()'s so that’s all I really need. Printing data is just a relic of me being bad at debuggers.