Odd problem with CUDA nested loop seems to not work

Zhang_Fan · January 20, 2009, 3:05am

I have noticed a topic posted by coderart.

And maybe i have the same problem with him.

The problem with me is that nested loop with unknown loop times within a kernel seems not work.

but sometimes it works if the loop times is known to the kernel.

But I do not know how to fix it.

Any help would be apprecaite.

I am having an odd problem with memory access when using CUDA and I hope that I am writing this in the correct place. If I am not, I would be happy to move it elsewhere.

I am using CUDA to process two large sets of data where each element of the first set is given its own thread which must then access each element of the second set. Neither set of data is being written to, however I do have room in global space for results to be written. The results area is indexed by the thread ID. Inside each thread I have a FOR loop which iterates over each element of the second set of data, does a calculation and stores the result in the global memory set aside for that thread. If I do this calculation outside of the loop, on a single element of the second data set, everything works fine and I get results back. The problem is that when I do the calculation for each element of the second data set and write the result to the same place (it is used later in another calculation which takes place inside the loop and works fine, but only writes its results once after the loop is finished) the device process appears to quit early and I get no usable data out of the program. I say the process quits early because it takes 0.1 seconds to finish instead of the 10 seconds it should take.

I was able to solve the problem for when I have a small second data set by adding a __syncthreads() call after the calculation inside the loop, but this stops working once the data set grows past a certain point, and actually causes the computer to hang indefinately, requiring a reboot.

Everything is being stored in global memory right now because I want to get the algorithm working before I start trying to optimize memory access time.

I hope someone is able to understand what I just said, I know it looks a little complex.

Thanks in advance.

yummig · January 20, 2009, 1:42pm

I suspect that there is a coding error somewhere. What I suggest you do is within the kernel, write out the number of iterations the for loop must cycle for each thread and then copy it back to the host and compare notes. If you don’t encounter a crash or driver reboot, then its likely your kernel is doing nothing since the watchdog timer isn’t timing out. You may find that your iteration limit is being evaluated to 0 or one of your threads is messing with you around. Be aware that the compiler will unroll loops to which are deterministic in size i.e. for(int i = 0; i < 10; i++) will iterate 10 times irrespective of runtime conditions - this topic is talked about in the programming guide and the compiler manual.

Zhang_Fan · January 20, 2009, 5:07pm

Thank you for your help.

And I notice a fact that:

In the kernel, there is a triple nested loop,

If i change the form of the triple nested loop in the kernel to the form a single loop as the following psuedo-code shows,

the CUDA seems to work right.

I guess the problem is CUDA will first estimate the times of the loop,

but a triple loop is hard to estimate. and the depth in the following code segment is not a constant but a variable

[codebox]

for(x=1;x<=depth;x++)

    for(y=1;y<=depth;y++)

         for(z=1;z<=depth;z++)

[/codebox]

=====>

[codebox]

for(i=1;i<=depthdepthdepth;i++)

    x=... y=... z=....

[/codebox]

Zhang_Fan · January 20, 2009, 5:08pm

Thank you for your help.

And I notice a fact that:

In the kernel, there is a triple nested loop,

If i change the form of the triple nested loop in the kernel to the form a single loop as the following psuedo-code shows,

the CUDA seems to work right.

I guess the problem is CUDA will first estimate the times of the loop,

but a triple loop is hard to estimate. and the depth in the following code segment is not a constant but a variable

[codebox]

for(x=1;x<=depth;x++)

    for(y=1;y<=depth;y++)

         for(z=1;z<=depth;z++)

[/codebox]

=====>

[codebox]

for(i=1;i<=depthdepthdepth;i++)

    x=... y=... z=....

[/codebox]

Topic		Replies	Views
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42839	April 25, 2007
problems with nested loop, again CUDA Programming and Performance	4	6521	March 15, 2011
Does CUDA automatically unroll loops? CUDA Programming and Performance	4	5650	September 16, 2011
'for' loop performance hacks? CUDA Programming and Performance	17	10523	February 28, 2009
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	229	June 11, 2024
CUDA limit for loops..? too large number of iterations? CUDA Programming and Performance	28	27374	March 20, 2008
CUDA double for loop inside kernel CUDA double for loop limit CUDA Programming and Performance	7	2053	October 14, 2010
Loop limit in CUDA kernel ? Too large loop => loop not launched CUDA Programming and Performance	12	6903	June 12, 2011
Please help me and run this code the question is if it just fails on my GFX board CUDA Programming and Performance	4	13563	March 7, 2011
Inconsistent kernel run times CUDA Programming and Performance	12	5789	August 5, 2009

Odd problem with CUDA nested loop seems to not work

Related topics