Thanks to @njuffa 's list of possible problems, as well as thorough explanations, I think I have tracked the issue down to a rounding error in the floating point computation. As noted, in problematic iterations of the for loop, ishift
would become a very large number.
The tipoff was this statement:
It seems to me that i - colcnt * ngy_d could occasionally return a negative result due to rounding in the floating-point computation. A negative result would turn into a large positive value when assigned to an unsigned integer type.
Indeed, ishift
in all cases was very close to 18,446,744,073,709,551,615, or the upper limit of an unsigned long int
on Linux platforms (64 bits). Tracking back from ishift
, the problem began with the calculation of colcnt
and was propagated through to rowcnt
, then yshift
and ishift
. Based on possible values for i
and ngy_d
, colcnt
should never fall below zero. Despite that, I think a casting error from float
to unsigned long int
must be responsible for it going negative, resulting in it wrapping around to a very large integer value.
To remedy this, I changed the colcnt
and rowcnt
calculations to:
unsigned int colcnt = static_cast<unsigned int>(i / ngy_d);
unsigned int rowcnt = static_cast<unsigned int>(i - colcnt * static_cast<unsigned int>(ngy_d));
These changes alone fixed the error, but I followed the advice to use unsigned int
and long long int
instead of unsigned long int
. Here is the reworked code:
__global__
void init_shift(float ngy_d, float ngx_d, cuComplex *grdCO_snc_d, cuComplex *grd_snc_d, cuComplex *grdCO_d, cuComplex *grd_d)
{
unsigned long long int n = static_cast<unsigned int>(ngy_d) * static_cast<unsigned int>(ngx_d);
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (unsigned long long int i = index; i < n; i += stride)
{
unsigned int colcnt = static_cast<unsigned int>(i / ngy_d);
unsigned int rowcnt = static_cast<unsigned int>(i - colcnt * static_cast<unsigned int>(ngy_d));
// shift function
unsigned int xshift, yshift;
if(colcnt < floor(ngx_d / 2))
{
xshift = colcnt + static_cast<unsigned int>(ceil(ngx_d / 2));
}
else
{
xshift = colcnt - static_cast<unsigned int>(floor(ngx_d / 2));
}
if(rowcnt < floor(ngy_d / 2))
{
yshift = rowcnt + static_cast<unsigned int>(ceil(ngy_d / 2));
}
else
{
yshift = rowcnt - static_cast<unsigned int>(floor(ngy_d / 2));
}
unsigned int ishift = xshift * static_cast<unsigned int>(ngy_d) + yshift;
float realtemp = grdCO_snc_d[i].x;
float imagtemp = grdCO_snc_d[i].y;
grdCO_d[ishift].x = realtemp;
grdCO_d[ishift].y = imagtemp;
realtemp = grd_snc_d[i].x;
imagtemp = grd_snc_d[i].y;
grd_d[ishift].x = realtemp;
grd_d[ishift].y = imagtemp;
}
}
And just to be thorough,
Quick sanity check: You are not compiling this code with -prec-div=false or -use_fast_math, correct?
That’s correct, I do not compile with either of those options.
A quick (but not conclusive) check on a potential code generation issue would be to lower the ptxas (compiler backend) optimization level. The nvcc default is -Xptxas -O3. You can try lowering that to -Xptxas -O2 etc. down to -O0. If the problem goes away as the optimization level is lowered, that could hint at a code generation issue, but just as with an inserted function call it could also just sufficiently modify the instruction sequence to mask some other issue.
I did not try changing the ptxas optimization level, but this is good to know as I continue to debug this code.
Thank you again for all your help. You’ve helped me learn a lot in the process.