Kernel launches fail occasionally

I am having a very weird problem: my program using CUDA fails occasionally. The failure typically occurs the first time I run my program after I clean everything and make again. I get all NaNs. Then if I run the program the second time using the same command, I get correct results. I am using a Linux system. Has anyone run into such a problem before? Can I get advice on how to fix this bug? Thanks!

Here is more information. The issue doesn’t only exist in first launches. As I tried more, it could show up at any launch.

Can you provide any executable code?

global void extrap_pt_sgl_src(const float wavNum, const vec3f* expPt, const int numExpPt,
const tri_elem* elem, const int numElem, const vec3f* pt, const cuFloatComplex* p,
const float strength, const vec3f src, cuFloatComplex p_exp)
{
/extrapolation from surface pressure to multiple points in free space
wavNum: wave number
expPt: extrapolation points in free space
p: surface pressure
src: location of the source
p_exp: pressure at the extrapolation points
/
int idx = blockIdx.x
blockDim.x+threadIdx.x;
if(idx < numExpPt) {
p_exp[idx] = extrapolation_pt(wavNum,expPt[idx],elem,numElem,pt,p,strength,src);
}
}

device cuFloatComplex extrapolation_pt(const float wavNum, const vec3f x,
const tri_elem* elem, const int numElem, const vec3f* pt,
const cuFloatComplex* p, const float strength, const vec3f src)
{
/field extrapolation from the surface to a single point in free space
x: the single point in free space
elem: pointer to mesh elements
pt: pointer to mesh nod and chief points
p: surface pressure
strength: intensity of the source
src: source location
/
cuFloatComplex result = ptSrc(wavNum,strength,src,x);
cuFloatComplex temp;
vec3f nod[3];
cuFloatComplex gCoeff[3], hCoeff[3];
float cCoeff[3];
for(int i=0;i<numElem;i++) {
for(int j=0;j<3;j++) {
nod[j] = pt[elem[i].nod[j]];
}
g_h_c_nsgl(wavNum,x,nod,gCoeff,hCoeff,cCoeff);
for(int j=0;j<3;j++) {
temp = cuCdivf(elem[i].bc[2],elem[i].bc[1]);
temp = cuCmulf(temp,gCoeff[j]);
result = cuCsubf(result,temp);
temp = cuCdivf(elem[i].bc[0],elem[i].bc[1]);
temp = cuCmulf(temp,gCoeff[j]);
temp = cuCsubf(hCoeff[j],temp);
temp = cuCmulf(temp,p[elem[i].nod[j]]);
result = cuCsubf(result,temp);
}
}
return result;
}

I can’t post the executable code, as the code base is very large including basic data structures. The kernel and the device function have been posted. One problem I am guessing is the loop in the device function. Depending on the problem size, the loop can be of several thousands of iteration. This may be against the philosophy of kernels: small and effective. Another problem that could lead to this issue is the large number of local variables in the device function. However, these are just my guesses, and I cannot find any documented material that validates my guesses.

Maybe you can pinpoint the error with

cuda-memcheck <executable>

Can I include command line options using “cuda-memcheck”?

Yes, you can.

I just checked memory following your advice. Found 0 error. But the result was NaN, which is exactly the problem that I pointed out. Basically, my code has no memory error, often generates correct results, and occasionally generates NaNs, using the same set of input arguments :)

use all the cuda-memcheck subtools:

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#using-cuda-memcheck

test with synccheck, initcheck, and racecheck (racecheck only makes sense if your code uses any shared memory)

I managed to find where the issue is. The program fails occasionally with this piece of code:

CUSOLVER_CALL(cusolverDnCgeqrf(cusolverH,numNod+NUMCHIEF,numNod,A_d,numNod+NUMCHIEF,
tau_d,workspace_d,lwork,deviceInfo_d));
CUDA_CALL(cudaMemcpy(&deviceInfo,deviceInfo_d,sizeof(int),cudaMemcpyDeviceToHost));
if(deviceInfo!=0) {
printf(“QR decomposition failed.\n”);
return EXIT_FAILURE;
}
CUDA_CALL(cudaMemcpy(A,A_d,(numNod+NUMCHIEF)numNodsizeof(cuFloatComplex),cudaMemcpyDeviceToHost));
HOST_CALL(CheckNanInMat(A,numNod+NUMCHIEF,numNod,numNod+NUMCHIEF));
CUDA_CALL(cudaDeviceSynchronize());

//B = (Q^H)*B
CUSOLVER_CALL(cusolverDnCunmqr(cusolverH,CUBLAS_SIDE_LEFT,CUBLAS_OP_C,numNod+NUMCHIEF,numSrc,
        numNod,A_d,numNod+NUMCHIEF,tau_d,B_d,numNod+NUMCHIEF,workspace_d,lwork,deviceInfo_d));
CUDA_CALL(cudaMemcpy(&deviceInfo,deviceInfo_d,sizeof(int),cudaMemcpyDeviceToHost));
if(deviceInfo!=0) {
    printf("QR decomposition failed.\n");
    return EXIT_FAILURE;
}
CUDA_CALL(cudaMemcpy(B,B_d,(numNod+NUMCHIEF)*numSrc*sizeof(cuFloatComplex),cudaMemcpyDeviceToHost));
HOST_CALL(CheckNanInMat(B,numNod,numSrc,numNod+NUMCHIEF));

Basically, this piece of code solves a linear system using QR. The problem OCCASIONALLY occurs when the Q matrix is multiplied to the right-hand side of the equation. The inputs of each running are the same. Could anyone help me analyze the problem? Thanks!