Loop limit in CUDA kernel ? Too large loop => loop not launched

nezix · November 30, 2010, 7:43pm

Hello,

I am wondering if my GPU is not lazy : I have to do a pretty big loop in my kernel and even if I split it into several loops, it just does not start it/them.

float dt=0.005;

...

__global__ void kernel(float array,...){

   float it;...

   for(it=5000;it<50000;it+=dt)

      new=do(stuff);

   array[idx]=new; //retrieve_stuff

}

When I cudaMemcpy back to my RAM nothing has changed between my host initialization and the result. But kernel takes effective time to run even if cuda timer says otherwise. I checked every small detail like memcpys, initialisations … I retrieved my loop indice and it does NOT change after my loop.

Whereas if I change 50 000 to 8000 the loop runs.

I hope this is not a too simple question but is there a limit for loops in CUDA ? What is the thing I am doing wrong ?

Thanks

tera · November 30, 2010, 8:33pm

Yes, there is a limit for loops in CUDA. It is the same as in C: You cannot loop over more iterations than the loop counter has different states.
Increase [font=“Courier New”]dt[/font], or switch to double precision.

nezix · November 30, 2010, 8:45pm

Actually, the loop does not cause troubles in C. And as I said even if I split my loop in 2 different ones it does not the trick, and double precision doesn’t help too (I have a 470 gtx and I am compiling with -arch=sm_20)

tera · November 30, 2010, 8:57pm

Ok, so you may have a second problem. Do you check return codes?

nezix · December 1, 2010, 8:40am

Yes but no error occures.

LSChien · December 1, 2010, 9:47am

Could you post the code which checks error code?

I wonder if it is watchdog problem, do you connect your GTX470 to display?

nezix · December 1, 2010, 12:57pm

Taken from dr dobb’s site.

void checkCUDAError(const char *msg){

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err){

		fprintf(stderr, "Cuda error: %s: %s.\n", msg,

		cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}

}

Yes I do. I also time my all program with the “time” unix command and it takes more time when I call my kernel than when I don’t.

nezix · December 1, 2010, 2:33pm

You are right my program is working (my results are wrong but maybe it is an other story) when I stop gnome. Do you know what causes that ?

seibert · December 1, 2010, 3:03pm

If you mean that you have stopped the X server entirely, then you are probably seeing the effect of the watchdog timer. To keep the GUI responsive, the driver will terminate any kernel that takes longer than ~5 seconds. If no GUI is running, then there is no time limit. When the kernel is terminated this way, you should get some error return value from the next CUDA call.

Daniel_nyu · June 10, 2011, 6:57pm

I have the same exact problem but I don’t think it’s a watchdog issue. My code is structured very similarly to the example code:

global void advanceSystem(float *cx, unsigned long int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned long int i;
float temp;

unsigned int lrng_state;

temp = cx[idx];

i=1;
while (i <= N) {
	temp = do stuff
            i++;
}

cx[idx] = temp;

}

I have even tried running it by saving N in constant memory with cudaMemcpy instead of passign it as an argument to the kernel and I get the same issues. I managet to run the kernel with N=10000 for the most part but sometimes it works with N=1000000. Sometimes (more rarely) it fails even with N=10000.
If it is of any importance, in the instances where it fails even for N=10000, I am calling the kernel in a loop from my C++ code:

while(k<something_big) {
advanceSystem<<num_threads/64, 64>>(cx, N);

}

Daniel_nyu · June 10, 2011, 7:03pm

I have the same exact problem but I don’t think it’s a watchdog issue. My code is structured very similarly to the example code:

global void advanceSystem(float *cx, unsigned long int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned long int i;
float temp;
//here I declare other stuff that i need on this thread//

unsigned int lrng_state;

temp = cx[idx];

i=1;
while (i <= N) {
	temp = do stuff
            i++;
}

cx[idx] = temp;

}

I have even tried running it by saving N in constant memory with cudaMemcpy instead of passign it as an argument to the kernel and I get the same issues. I managet to run the kernel with N=10000 for the most part but sometimes it works with N=1000000. Sometimes (more rarely) it fails even with N=10000.
If it is of any importance, in the instances where it fails even for N=10000, I am calling the kernel in a loop from my C++ code:

while(k<something_big) {
advanceSystem<<num_threads/64, 64>>(cx, N);
cudaThreadSynchronize();
k++;
}

I’m starting to think that it must be a memory leak or something.
For further clarification, the thread_dependent stuff I declare in the kernel constitutes of 8 vectors of type float (2 floats for each vector) and two additional floats.

If someone could help me I would be immensely grateful.
I’m goign crazy over this.

Daniele

nezix · June 11, 2011, 1:35pm

To be sure the kernel is not interrupted by the wachdog timer, just run it with your X server shut down.

An easy way to see if your loop is running is to do :

__global__ void advanceSystem(float *cx, unsigned long int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

unsigned long int i;

//here I declare other stuff that i need on this thread//

i=1;

while (i <= N) {

i++;

}

cx[idx] =(float)i;

}

Then you check if all the cx array is full of your N value.

Skybuck · June 12, 2011, 2:34pm

What is number of threads in the kernel call ?