Crashes apparently related to code size

I am attempting to implement a fairly complicated calculation, involving the inverse of a 5x5 matrix with complex elements. I calculate the determinant of the matrix by hand, with a large number of lines like this:

det += n15*n24*n33*n42*n51;

where nIJ is the (I, J)th element of the matrix. (It’s not my fault; I inherited this code from someone else and am porting it to CUDA.) However, the program crashes if I include too many of these lines. It is not any particular line that causes the problem - if I comment out the first three, I can add three more at the end without causing a crash. So it appears that this crash is related to some sort of memory issue with spill loads or stores. Does anyone have experience with nvcc getting confused, or running out of “spill memory”, if the code uses too many local variables? If so, how did you fix it?

I’ve tried reorganising the code to use an accumulator variable instead of the quintuple multiplication above, so that the line looks like this:

devcomplex<double> acc(0, 0);
acc  = n15;
acc *= n24;
acc *= n33;
acc *= n42;
acc *= n51;
det += acc;

where ‘devcomplex’ is my complex-number type; this allows me to include more lines, but not enough for the whole calculation. Does anyone have advice along these lines for reducing the memory footprint of the calculation?

I’ve reproduced this problem with CUDA 4.2 and 5.0, on C2050 and C2070.

My next approach will be to see if I can outsource the matrix inversion to cuBLAS. Any other ideas are also welcome.

What does NVCC report when the “-Xptxas=-v” option is enabled?

An immense series of mults and adds, like so:

add.f64         %fd289, %fd286, %fd288;
	.loc 34 622 1
	mul.f64         %fd290, %fd174, %fd182;
        .loc 5 2416 3
        div.rn.f64      %fd291, %fd290, %fd155;
        .loc 18 67 1
	add.f64         %fd292, %fd289, %fd291;
	.loc 34 622 1
        mul.f64         %fd293, %fd178, %fd178;
	.loc 5 2416 3
	div.rn.f64      %fd294, %fd293, %fd126;
        .loc 18 67 1
	add.f64         %fd295, %fd294, 0d0000000000000000;
        .loc 34 622 1
	mul.f64         %fd296, %fd179, %fd179;
        .loc 5 2416 3
	div.rn.f64      %fd297, %fd296, %fd134;
        .loc 18 67 1
        add.f64         %fd298, %fd295, %fd297;
	.loc 34 622 1
        mul.f64         %fd299, %fd180, %fd180;

This looks to me like the inlined addition and multiplication functions of the complex type. I must confess I’m a bit discouraged at the thought of hunting through multiple pages of this to search for a problem, unless that 0d000… is significant?

Why not try the “batched solver” package from the registered developer website? That includes optimized code to invert small matrices. The source comes with a BSD license so you should be able to incorporate it into just about any type of project.

The “0d0000000000000000” in the PTX code above is the hexadecimal representation of 0.0. Note that IEEE-754 semantics disallow the transformation x + 0 -> x, so adding 0.0 is not advised.

Of what nature is the “crash”? Is it a compile time issue or a run-time issue? If it is the former, does the compiler emit an error messages, segfault, … ? If it is the latter what status is returned by the kernel launch and surrounding CUDA APIs? Do youe get a time-ount, unspecified launch failure, … ?

The addition of zero seems odd. Perhaps it is the first time something initialised to zero is used, and the compiler just took the static zero instead of a register? But then you have to wonder why it does the addition at all. I find this confusing.

The problem occurs at runtime. I am using the Thrust library, so I’m getting a somewhat un-informative thrust::system_error. I’ll see if I can extract the function from my framework and make a minimal program that shows the error, with raw CUDA kernel launches.

I will try the batched solver, as you suggest.

Sorry, I am not familiar with Thrust. If this problem occurs with the latest shipping CUDA version, i.e. CUDA 5.5, consider filing a bug via the bug reporting form linked from the registered developer website.

As long as your batch sizes are sufficiently large, you should get excellent performance from the batched matrix inverse, as for very small matrices each matrix is inverted by a single thread with all of the intermediate data kept in registers.

Back to this, and somewhat confused. I stripped the offending code out of my framework and put it in a kernel, which no longer crashes. It does however produce strange results when run with many threads. In particular, for one thread it produces sensible output:

i[0][0] : (0.736739, 0.016569) 
det: (1.435338, -2.694346)
Result: 0 0

but for 513 threads it does something strange:

Result: -1.45682e+144 -1.45682e+144

I find this 513 very suspicious, since for 512 it gives the same output as for 1, and 1024 is the maximum number of threads per block. Looks like some issue of shared (ie spill) memory running out, perhaps?

I’ve attached the offending code.
kmat.tar (30 KB)