'ptxas' died due to signal 11 error!

MutantJohn · October 8, 2014, 2:57am

Good Evening,

So, I’m trying compile some code and for some odd reason, using a printf() statement in my kernel causes the compiler to die. I googled my error and apparently this is indicative of a compiler bug which I find interesting. I’m not sure if it is or not.

But I’m getting this output from nvcc :

make
nvcc -O3 -lstdc++ -rdc=true -gencode arch=compute_50,code=sm_50 -c tetra.cu
nvcc error   : 'ptxas' died due to signal 11 (Invalid memory reference)
nvcc error   : 'ptxas' core dumped
Makefile:18: recipe for target 'tetra.o' failed
make: *** [tetra.o] Error 139

And this is the kernel that is causing it all :

typedef float real;

const int tpb = 256; // threads per block
const int bpg = 512; // blocks per grid

struct point
{
    // Use a union as hybrid storage

    union
    {
        struct
        {
            real x, y, z;
        };

        real p[3];
    };

    __host__ __device__
    point(real a, real b, real c) : x(a), y(b), z(c) { };

    __host__ __device__
    void print(void) const
    {
        printf("(%.00f, %.00f, %.00f)\n", x, y, z);
    };
};

struct tetrahedron
{
    int v[4]; // list of vertices in point buffer

    __host__ __device__
    tetrahedron(int v0, int v1, int v2, int v3)
    {
        v[0] = v0; v[1] = v1; v[2] = v2; v[3] = v3;
    };
};

__global__
void fracture(const int n,
              const unsigned char *location_code,
                    tetrahedron *tetrahedra,
              const int *tet_index,
              const int *insertion_marked,
              const int num_tetrahedra,
              const int *pt_index)
{
    const int thread_num = threadIdx.x + blockIdx.x * blockDim.x;

    for (int tid = thread_num; tid < n; tid += blockDim.x * gridDim.x)
    {
        const unsigned char loc = location_code[tid];

        if (loc != 0)
        {
            // To fracture, we need the tetrahedron for its face data

            const tetrahedron t = tetrahedra[tet_index[tid]];

            const int faces[4][3] = { { t.v[3], t.v[2], t.v[1] },
                                      { t.v[0], t.v[2], t.v[3] },
                                      { t.v[0], t.v[3], t.v[1] },
                                      { t.v[0], t.v[1], t.v[2] }
                                    };

            tetrahedron *address = tetrahedra + tet_index[tid];
            int pos = 0;

            for (int i = 0; i < 4; ++i)
            {
                if (loc & (1 << i))
                {
                    const tetrahedron *tmp = 
                        new(address) tetrahedron(faces[i][0],
                                                 faces[i][1],
                                                 faces[i][2],
                                                 pt_index[tid]);

                    address = tetrahedra + num_tetrahedra
                            + insertion_marked[i] + pos;
                    ++pos;

                    // This is the print that causes the compiler to die.

                    //printf("%d, %d, %d, %d\n", tmp->v[0], tmp->v[1], tmp->v[2], tmp->v[3]);
                }
            }
        }
    }
}

Edit :

Apparently, this code will work though :

if (loc & (1 << i))
                {
                    const tetrahedron *tmp = 
                        new(address) tetrahedron(faces[i][0],
                                                 faces[i][1],
                                                 faces[i][2],
                                                 pt_index[tid]);
                    printf("%u\n", (unsigned long ) tmp);
                    address = tetrahedra + num_tetrahedra
                            + insertion_marked[i] + pos;
                    ++pos;
                    
                    //printf("%d, %d, %d, %d\n", tmp->v[0], tmp->v[1], tmp->v[2], tmp->v[3]);
                }

Output :
30932992
30933008
30933024

Robert_Crovella · October 8, 2014, 3:37am

Can you provide a code that is complete? Something that I can copy, paste, compile, and see the issue, without having to add anything or change anything? I tried to compile what you have shown, and I get this:

t590.cu(78): error: function “operator new” cannot be called with the given argument list
argument types are: (unsigned long, tetrahedron *)

MutantJohn · October 8, 2014, 5:30am

Alright, the project can be found on github here : GitHub - cmazakas/regulus_v1.5: Degenerate Delaunay in CUDA

I think this should build with just the “make” command in Linux.

Problem kernel is in tetra.cu at the top.

Robert_Crovella · October 8, 2014, 7:57am

I downloaded GDelShewchukDevice.h, structures.h, predicates.h, and tetra.cu

then I ran:

nvcc -O3 -lstdc++ -rdc=true -gencode arch=compute_50,code=sm_50 -c tetra.cu

and it compiled fine, with no errors or warnings. CUDA 6.5, Fedora 20

Which CUDA version are you using? And on what OS?

Robert_Crovella · October 8, 2014, 1:33pm

OK I got it. Have to uncomment the printf.

MutantJohn · October 8, 2014, 4:13pm

Oh yeah, sorry; I didn’t ship it broken. I wanted to make sure I would only release something that really could compile successfully XD

I have no idea why that printf() kills the compiler. I can print other stuff in that same kernel, I just can’t seem to read from the address. But that should be a run-time error, not a compiler one.

Robert_Crovella · October 8, 2014, 4:20pm

Thanks for reporting the issue. A side effect of commenting out the printf is (when optimized) some dead code removal around the stuff associated with tmp. I believe this is the reason for the change in behavior when you comment/uncomment that printf. However that in no way explains why the ptxas dies.

Separately, I’ve observed that if I compile with compute_35,sm_35 instead of compute_50,sm_50, then the compile is successful.

I’ve reported a bug with NVIDIA. If I learn anything relevant I will report back.

MutantJohn · October 8, 2014, 4:59pm

Hey, thank you!

Also, I would laugh so hard but do you think it might be because the optimizer trims the code away and then I’m trying to print it out?

Like, the code should only appear when I’m printing (or storage of tmp because all I’m really after is the constructor’s effects and not a copy of where all this is happening) so what if the optimizer was trimming the storage while still trying to print it at the same time? I can print other variables. Like, I can print tmp itself.

But I think i I don’t make a direct call to tmp, the compiler trims tmp off and then crashes when it wants to read from an invalid address. I’m calling tmp->v, not tmp itself would be the key distinction.

Robert_Crovella · October 8, 2014, 5:18pm

I’m not following your logic. The point I was trying to make (which I don’t think is a very important one) is as follows:

The variable tmp is local in scope to the body of this if statement:

if (loc & (1 << i))

When the printf is in the body of that if statement, then the tetrahedron object pointed to by tmp is actually used, by the printf. In this instance, I’m not suggesting there is any dead-code removal, and we do observe the ptxas crash.

When the printf is not in the if statement body, then there is no code anywhere that depends on tmp. In this case, the allocation of tmp and creation of a tetrahedron object at this point is dead code, and can safely be removed by the compiler. And we observe a difference in behavior of ptxas. I’m suggesting that the code that gets removed is somehow (in some perhaps indirect way) the issue (when it is present - crash, when it is not present - no crash), and not likely the actual printf statement itself. This last bit is just speculation.

I’m not suggesting I understand your code fully, but a general rule of thumb is that ptxas should never crash. Combined with the fact that the compiler/ptxas seems to be happy with sm_35 generation instead of sm_50 generation suggests to me a ptxas/compiler bug (as opposed to some defect in your code).

I don’t have any further insight at the moment. You are likely a much smarter programmer than I am, and I’m certain that the compiler engineers at NVIDIA are much smarter than I am. If I learn anything relevant, I’ll report back.

MutantJohn · October 8, 2014, 5:56pm

But I was though!

I am most likely wrong though. It was just a fun thought. I have no idea why the compiler dies doing it.

Robert_Crovella · October 8, 2014, 6:40pm

I think maybe you misread. I am saying “we do observe the ptxas crash”.

With the printf: no dead code removal, crash is observed
without the printf: dead code removal, no crash

njuffa · October 8, 2014, 7:55pm

As a rule of thumb, whenever a compiler component segfaults [as PTXAS did here], or produces an error message that mentions an “internal compiler error”, it is a good idea to go ahead and file a bug via the form on the registered developer website (provided you are using the latest released tool chain).

While such errors may be ultimately be caused by invalid or otherwise problematic source code being compiled, abnormal terminations should be brought to the attention of the compiler team, as broken source code should cause a proper error message to be emitted, not crash the compiler.

MutantJohn · October 8, 2014, 8:41pm

But my code is perfect…

(This is a joke post, btw)

njuffa · October 8, 2014, 8:51pm

Even perfect code can trigger compiler bugs that manifest as segfaults :-)

Yes, I understand you’re joking. My point is, it does not matter how perfect or how broken a given piece of CUDA code is, if the compiler segfaults or reports an internal compiler error, this is not an acceptable response by the compiler, and such instances should be reported as bugs to NVIDIA.

MutantJohn · October 9, 2014, 9:36pm

Any updates from the nVidia people?

Is there also a link to the filed bug?

Robert_Crovella · October 9, 2014, 9:58pm

The bug is not an external one. You’re welcome to file your own bug, in which case you’ll have some trackability of your own. The way these things go, I really don’t find out much until the next release (CUDA 7) is imminent. If you need an immediate workaround I would suggest compiling for sm_35. If you specify -arch=sm_35 instead of the more elaborate forms of target specification, your code will run on any maxwell also.

The bug I filed is 1562955 if you want to refer to it in correspondence. But you won’t be able to access it directly yourself using that number.

MutantJohn · October 9, 2014, 9:58pm

Eh, I’ll just trust my code without testing…

I know, it sounds crazy but believe me, I’ll find out really, really quickly if my data is bad. Plus, I still have the use of assert()'s so that’s all I really need. Printing data is just a relic of me being bad at debuggers.

Topic		Replies	Views
CUDA 1.1 Bug - Compiler crash (ptxas) w/repro CUDA Programming and Performance	16	8685	May 19, 2008
running code from cudatoolkit 3.2 to 4.0 -- ptxas error CUDA Programming and Performance	3	4008	August 17, 2011
ptxas - invalid memory reference CUDA Programming and Performance	2	852	April 20, 2011
'ptxas' died CUDA Programming and Performance	4	4282	August 14, 2008
nvcc/ptxas bug segfault in ptxas (CUDA-1.1) on x86_64 CUDA Programming and Performance	2	13300	January 14, 2008
ptxas segfault crashing the compiler on large kernel CUDA Programming and Performance	10	9680	April 4, 2008
nvcc keeps dying on me CUDA Programming and Performance	1	654	January 7, 2015
PTXAS DIED problem CUDA Programming and Performance	1	1352	March 26, 2013
nvcc error: 'ptxas' died due to signal 11 is it a bug of nvcc or mine? CUDA Programming and Performance	4	2772	November 4, 2008
ptxas died due to signal 11 (Invalid memory reference) CUDA Programming and Performance	2	3325	September 3, 2009

'ptxas' died due to signal 11 error!

Related topics