I have a cuda code that I converted from Double Precision calculation into Single Precision calculation. I did this by just changing all possible double call into float for example:

```
cufftDoubleComplex -> cufftComplex
cuMadd -> cuMaddf
```

and so on, it now compiles well without errors and runs completely fine except that it gives me NaN (Not a Number).

I traced that a “kernel call” corrupts the nice array received by the “cu” files.

Here are the code snippets:

somenice.cu

```
extern "C" void gpu_fourwf_( ..., float *mytarget, ....){
...
some code here that allocates gpu mem (works fine in double)
...
printf("see if target is ok %.7f", mytarget); //result is ok 0.4
cuda_return = cudaMemcpy(mytarget,mytarget_gpu,2*(*npwout)*(*ndat)*sizeof(float),cudaMemcpyDeviceToHost);
...error checking here
printf("see if target is ok %.7f", mytarget); //result is NaN
...
}
```

somethingkernel.cu

```
...more codes here
__global__ void kernel_sphere_out(float *cfft,float *cg, int *kg_k,int npw,int ndat,int n1,int n2,int n3,float norm){
int ig,idat,thread_id;
int i1,i2,i3;
thread_id = threadIdx.x + blockIdx.x*blockDim.x;
idat = blockIdx.y;
for(ig=thread_id; ig<npw; ig+=blockDim.x*gridDim.x){
i1=kg_k[ig*3];//kg_k(1,ipw)
i2=kg_k[ig*3 + 1];//kg_k(2,ipw)
i3=kg_k[ig*3 + 2];//kg_k(3,ipw)
if(i1<0)
i1+=n1;
if(i2<0)
i2+=n2;
if(i3<0)
i3+=n3;
cg[2*(ig + npw*idat)] = norm * cfft[2*(i1 + n1*(i2 + n2*(i3+n3*idat)))] ;
cg[1 + 2*(ig + npw*idat)] = norm * cfft[1+ 2*(i1 + n1*(i2 + n2*(i3+n3*idat)))] ;
}
}
extern "C" void kernel_call_initializer(float *cg,float *cfft,int *kg_k,int *npw,int *n1,int *n2,int *n3,int* ndat,cudaStream_t *compute_stream){
dim3 grid,bloc;
int cfft_size=(*n1)*(*n2)*(*n3);
float norme=1.f/cfft_size;
kernel_sphere_out<<<grid,bloc,0,*compute_stream>>>(cfft,cg,kg_k,*npw,*ndat,*n1,*n2,*n3,norme);
}
```

…more codes here

I could not tell what the code does, I am just porting it into single precision, but kernel runs fine with double so there is nothing wrong with the mechanisms of the code. As far as I can tell, I made all necessary changes to the cuda codes so that it will compute in single precision. The kernel code looks fine to me, however, I dont know why this “kernel” call corrupts the `mytarget`

array.

Are there some changes I need to do further in the kernel code to do float operations? Important hints on further debugging the culprit would be helpful

System Info:

Pascal GTX 1060 6G

Fedora Linux Kernel 4.8

Cuda 8.0

Additional info:

Its a big code so I just included only the relevant parts, somebody may ask for more details of the code so here is it

https://github.com/qsnake/abinit/tree/master/src/15_gpu_toolbox

https://github.com/qsnake/abinit/tree/master/src/52_manage_cuda

https://github.com/qsnake/abinit/tree/master/src/incs