I’ve been working on debugging a kernel which seems to be hitting the wall on loop nesting. I am launching blocks of 32 threads (small, but I have a lot of shared memory requirements which preclude larger blocks), which when compiled down to cubin are using lmem = 72 smem = 48 reg = 39, which fails with the unspecified error. Commenting a nested loop, which changes the cubin profile to lmem = 48 smem = 48 reg = 38, enables successful compilation.
for (j = curLoc + 2; j < len; j++)
int k, l;
float radsum, dsq;
for (k = 0; k < d_nscatoms[curRes.type]; k++)
for (l = 0; l < d_nscatoms[res[j].type]; l++)
radsum = d_atomrad[curRes.sc[k].type] + d_atomrad[res[j].sc[l].type];
dsq = distsq(curRes.sc[k].pos, res[j].sc[l].pos);
if (dsq <= SQR(radsum))
steric += SQR(radsum) - dsq;
being the offending section. The values prefixed with d_ are stored as static constant arrays, curRes and res are stored in shared (and accessed via macro). Note, a similar calculation (in terms of fields accessed, except not contained within a for loop) farther within the main loop does execute properly. Lastly, I know I am not hitting the register count limit, as an earlier revision of my code used 48 registers and 328 lmem with no problem, using the same configuration settings. Further investigation is showing that commenting out the device function distsq within the nested loop removes the unspecified launch failure as well, although this function is successfully used throughout the rest of the kernel…
So, the main questions are: what is the hard register count limit, to ensure my code is under it, and is there any workaround for nested loops of this sort causing launch failures?
I am launching with blocks of 32 threads (32,1,1), in grids of (in this test) (3,16), with the 3 and 16 being dependent on the data being processed, and generally structured to a multiple of 16. The number of launched blocks does not impact the unspecified launch failure.
I believe I have identified the root problem to be related to indexing into a constant array directly from either shared or global memory (the d_atomrad[res[j].sc[l].type]; line, for example). Additionally, there is an issue related to calling a device function with a structure parameter stored in shared or global memory (as opposed to local registers).
Restructuring my code to copy any shared or global memory values into local registers prior to using them in device function calls, or as indexes to constant memory resolved the launch failures, although part of the problem could have been related to the constant index lookup being in a loop conditional…
Sample of Failure Free Code:
// Workaround for indexing into constant
int nEnd = d_nscatoms[curRes.type];
for (k = 0; k < nEnd; k++)
int nEndTwo = d_nscatoms[shared[j].type];
for (l = 0; l < nEndTwo; l++)
radsum = d_atomrad[curRes.sc[k].type] + d_atomrad[shared[j].sc[l].type];
// workaround for bug in fetching structs to function params
dsq = distsq(tempfloat1, tempfloat2);
radsqr = SQR(radsum);
steric += (dsq <= radsqr) * (radsqr - dsq);
as before, shared being an extern shared Residue array, allocated to about 8 KB per block, curRes being in local registers, and all d_ prefixed arrays being in constant storage.