I am wondering if my GPU is not lazy : I have to do a pretty big loop in my kernel and even if I split it into several loops, it just does not start it/them.
When I cudaMemcpy back to my RAM nothing has changed between my host initialization and the result. But kernel takes effective time to run even if cuda timer says otherwise. I checked every small detail like memcpys, initialisations … I retrieved my loop indice and it does NOT change after my loop.
Whereas if I change 50 000 to 8000 the loop runs.
I hope this is not a too simple question but is there a limit for loops in CUDA ? What is the thing I am doing wrong ?
Yes, there is a limit for loops in CUDA. It is the same as in C: You cannot loop over more iterations than the loop counter has different states.
Increase [font=“Courier New”]dt[/font], or switch to double precision.
Actually, the loop does not cause troubles in C. And as I said even if I split my loop in 2 different ones it does not the trick, and double precision doesn’t help too (I have a 470 gtx and I am compiling with -arch=sm_20)
If you mean that you have stopped the X server entirely, then you are probably seeing the effect of the watchdog timer. To keep the GUI responsive, the driver will terminate any kernel that takes longer than ~5 seconds. If no GUI is running, then there is no time limit. When the kernel is terminated this way, you should get some error return value from the next CUDA call.
I have the same exact problem but I don’t think it’s a watchdog issue. My code is structured very similarly to the example code:
global void advanceSystem(float *cx, unsigned long int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned long int i;
float temp;
unsigned int lrng_state;
temp = cx[idx];
i=1;
while (i <= N) {
temp = do stuff
i++;
}
cx[idx] = temp;
}
I have even tried running it by saving N in constant memory with cudaMemcpy instead of passign it as an argument to the kernel and I get the same issues. I managet to run the kernel with N=10000 for the most part but sometimes it works with N=1000000. Sometimes (more rarely) it fails even with N=10000.
If it is of any importance, in the instances where it fails even for N=10000, I am calling the kernel in a loop from my C++ code:
I have the same exact problem but I don’t think it’s a watchdog issue. My code is structured very similarly to the example code:
global void advanceSystem(float *cx, unsigned long int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned long int i;
float temp;
//here I declare other stuff that i need on this thread//
unsigned int lrng_state;
temp = cx[idx];
i=1;
while (i <= N) {
temp = do stuff
i++;
}
cx[idx] = temp;
}
I have even tried running it by saving N in constant memory with cudaMemcpy instead of passign it as an argument to the kernel and I get the same issues. I managet to run the kernel with N=10000 for the most part but sometimes it works with N=1000000. Sometimes (more rarely) it fails even with N=10000.
If it is of any importance, in the instances where it fails even for N=10000, I am calling the kernel in a loop from my C++ code:
I’m starting to think that it must be a memory leak or something.
For further clarification, the thread_dependent stuff I declare in the kernel constitutes of 8 vectors of type float (2 floats for each vector) and two additional floats.
If someone could help me I would be immensely grateful.
I’m goign crazy over this.
To be sure the kernel is not interrupted by the wachdog timer, just run it with your X server shut down.
An easy way to see if your loop is running is to do :
__global__ void advanceSystem(float *cx, unsigned long int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned long int i;
//here I declare other stuff that i need on this thread//
i=1;
while (i <= N) {
i++;
}
cx[idx] =(float)i;
}
Then you check if all the cx array is full of your N value.