CUDA Kernel corruption of Array in Pascal Compute capability 6.1

I have a cuda code that I converted from Double Precision calculation into Single Precision calculation. I did this by just changing all possible double call into float for example:

cufftDoubleComplex -> cufftComplex
cuMadd -> cuMaddf

and so on, it now compiles well without errors and runs completely fine except that it gives me NaN (Not a Number).

I traced that a “kernel call” corrupts the nice array received by the “cu” files.

Here are the code snippets:

somenice.cu

extern "C" void gpu_fourwf_( ..., float *mytarget, ....){
        ...
        some code here that allocates gpu mem (works fine in double)
        ...

        printf("see if target is ok %.7f", mytarget); //result is ok 0.4

        cuda_return = cudaMemcpy(mytarget,mytarget_gpu,2*(*npwout)*(*ndat)*sizeof(float),cudaMemcpyDeviceToHost);

        ...error checking here

        printf("see if target is ok %.7f", mytarget); //result is NaN

        ...
}

somethingkernel.cu

...more codes here

    __global__ void kernel_sphere_out(float *cfft,float *cg, int *kg_k,int npw,int ndat,int n1,int n2,int n3,float norm){

        int ig,idat,thread_id;

        int i1,i2,i3;

        thread_id = threadIdx.x + blockIdx.x*blockDim.x;

        idat = blockIdx.y;

        for(ig=thread_id; ig<npw; ig+=blockDim.x*gridDim.x){

            i1=kg_k[ig*3];//kg_k(1,ipw)

            i2=kg_k[ig*3 + 1];//kg_k(2,ipw)

            i3=kg_k[ig*3 + 2];//kg_k(3,ipw)

            if(i1<0)

                i1+=n1;

            if(i2<0)

                i2+=n2;

            if(i3<0)

                i3+=n3;

            cg[2*(ig + npw*idat)]     = norm * cfft[2*(i1 + n1*(i2 + n2*(i3+n3*idat)))] ;

            cg[1 + 2*(ig + npw*idat)] = norm * cfft[1+ 2*(i1 + n1*(i2 + n2*(i3+n3*idat)))] ;

        }

    }

extern "C" void kernel_call_initializer(float *cg,float *cfft,int *kg_k,int *npw,int *n1,int *n2,int *n3,int* ndat,cudaStream_t *compute_stream){

         dim3 grid,bloc;

         int cfft_size=(*n1)*(*n2)*(*n3);

         float norme=1.f/cfft_size;

         kernel_sphere_out<<<grid,bloc,0,*compute_stream>>>(cfft,cg,kg_k,*npw,*ndat,*n1,*n2,*n3,norme);

    }

…more codes here

I could not tell what the code does, I am just porting it into single precision, but kernel runs fine with double so there is nothing wrong with the mechanisms of the code. As far as I can tell, I made all necessary changes to the cuda codes so that it will compute in single precision. The kernel code looks fine to me, however, I dont know why this “kernel” call corrupts the mytarget array.

Are there some changes I need to do further in the kernel code to do float operations? Important hints on further debugging the culprit would be helpful

System Info:
Pascal GTX 1060 6G
Fedora Linux Kernel 4.8
Cuda 8.0

Additional info:
Its a big code so I just included only the relevant parts, somebody may ask for more details of the code so here is it

https://github.com/qsnake/abinit/tree/master/src/15_gpu_toolbox
https://github.com/qsnake/abinit/tree/master/src/52_manage_cuda
https://github.com/qsnake/abinit/tree/master/src/incs

Anyone? It seems corruption only happen after cudaMemcpy, I checked the kernel output of all variables using printf from within the kernel and nothing gives NaN there, the float-kernel gives the same printf output with double-kernel, but the cudaMemcpy of float gives corrupted result. Does somebody possibly have any Idea?

I would try to track down the origin of the NaNs. Assuming they are the result of mathematical operations, rather than picked up via out-of-bounds access, there are a limited number of operations that can create them such as +INF + -INF = NaN, INF / INF, 0/0, 0*INF.

Note that this could mean you have to trace back the origin of the generating INF or 0. Maybe these were created elsewhere by overflow and underflow of the single-precision format, where the wider double precision format previously used could hold these value without overflow or underflow.

In general, mechanically converting code from double precision to single precision seems iffy to me, besides numeric range issues there could be accuracy issues, e.g. failure to converge to a given error bound that was appropriate for double precision but is not suitable for single precision