# Cuda Newbie - Cuda Returning 0's

This is probably a dumb beginner question. I’m giving cuda a whirl. It’s my first time GUP programming.

The GPU is the graphics card for the Linux box I’m working on. NVIDIA Corporation GK106 [GeForce GTX 645 OEM] (rev a1)

I have a test program (below) ginned up from examples from the web. Basically I’m trying to break things.

It’s got a number, N, of blocks and each block steps through M iterations of a simple non-linear iteration.

If N and M are small it works like a champ. If I make them too large the results are not right. The result should be anything but 0 and I’m getting all 0’s out.

There doesn’t seem to be a hard limit on N or M (in this example). When N is 4096 and M is 30000000 I get failure. If I dial back M, then N of 4096 is OK. If I dial back N then M of 30000000 is OK.

You’ll also see dumping of an error code. It always returns 0, that everything is OK.

I would love some tips where to get started understanding this behavior. I’m perfectly happy if there are limitations to what I can do, but I would like to understand how I can tell if a computation was successful or not. If I didn’t know a-priori that 0 was a nonsense result I would like a way to tell that the computation failed. This error code isn’t telling me anything, but maybe I’m not looking in the right place.

Is this an artifact of the fact that I’m using my live graphics card simultaneously with this computation? The whole screen freezes when I run my code, which I take to be a good sign that it’s actually using the GPU.

Thanks for any help y’all can give!

#include <stdio.h>
#include <stdlib.h>

#define M 30000000
#define N 4096

global void iterate(float *xin, float *xout){

``````// for(int i = 0 ; i < 10 ; i++){
//   xout[blockIdx.x] =  xin[blockIdx.x] * xin[blockIdx.x]  -2.0;
//   xin[blockIdx.x] = xout[blockIdx.x];
// }

//    for(int j = 0 ; j < 100 ; j++){
for(int i = 0 ; i < M  ; i++){
xout[blockIdx.x] =  xin[blockIdx.x] * xin[blockIdx.x]  -2.0;
xin[blockIdx.x] = xout[blockIdx.x];
}
//    }
``````

}

int main(void){
float *xin;
float *xout;

float *d_xin;
float *d_xout;

float size = N * sizeof(float);

cudaMalloc((void**)&d_xin,size);
cudaMalloc((void**)&d_xout,size);

xin = (float*)malloc(size);
xout = (float*)malloc(size);

for(int i = 0 ; i < N ; i++){
xin[i] = -0.1 + 0.2 * (float)rand()/(float)RAND_MAX;
}

cudaMemcpy(d_xin,xin,size,cudaMemcpyHostToDevice);

int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
printf( “deviceCount= %d\n”, deviceCount );

cudaGetDeviceProperties(&prop,0);
printf(“totalGlobalMem:%d\n”,prop.totalGlobalMem);

iterate<<<N,1>>>(d_xin,d_xout);
cudaError_t t = cudaPeekAtLastError();
printf(“ERROR CODE%d - %s\n”,t,cudaGetErrorString(t));

cudaMemcpy(xout,d_xout,size,cudaMemcpyDeviceToHost);

for(int i = 0 ; i < N ; i++){
printf(“%d %f %f\n”,i,xin[i],xout[i]);
}

free(xin);
free(xout);

cudaFree(d_xin);
cudaFree(d_xout);

return 0;

}

Your error checking is incomplete. Google “proper CUDA error checking” and take the first hit, then apply that throughout your code. You need two checks after a kernel call and every API call returns an error code that should be checked.

Also, you don’t mention how you are compiling this code, but that can be important. The first kernel config parameter (N, in this case) is limited to 65535 if you don’t specify a cc3.0 target. So I suggest compiling like:

``````nvcc -arch=sm_30 ...
``````

although that is probably not an issue unless you are making N much larger than 4096.

You’re using an incorrect printf format specifier here:

``````printf("totalGlobalMem:%d\n",prop.totalGlobalMem);
``````

Finally, this kernel will take a long time as written (probably at least around 30 seconds). Both windows and linux have the possibility for a kernel timeout if it is taking too long. If you are running the linux display on this GTX645 GPU, it’s likely that you may be hitting the kernel timeout as M and N get larger. You can get a more detailed understanding of this here: