I have noticed a topic posted by coderart.
And maybe i have the same problem with him.
The problem with me is that nested loop with unknown loop times within a kernel seems not work.
but sometimes it works if the loop times is known to the kernel.
But I do not know how to fix it.
Any help would be apprecaite.
I am having an odd problem with memory access when using CUDA and I hope that I am writing this in the correct place. If I am not, I would be happy to move it elsewhere.
I am using CUDA to process two large sets of data where each element of the first set is given its own thread which must then access each element of the second set. Neither set of data is being written to, however I do have room in global space for results to be written. The results area is indexed by the thread ID. Inside each thread I have a FOR loop which iterates over each element of the second set of data, does a calculation and stores the result in the global memory set aside for that thread. If I do this calculation outside of the loop, on a single element of the second data set, everything works fine and I get results back. The problem is that when I do the calculation for each element of the second data set and write the result to the same place (it is used later in another calculation which takes place inside the loop and works fine, but only writes its results once after the loop is finished) the device process appears to quit early and I get no usable data out of the program. I say the process quits early because it takes 0.1 seconds to finish instead of the 10 seconds it should take.
I was able to solve the problem for when I have a small second data set by adding a __syncthreads() call after the calculation inside the loop, but this stops working once the data set grows past a certain point, and actually causes the computer to hang indefinately, requiring a reboot.
Everything is being stored in global memory right now because I want to get the algorithm working before I start trying to optimize memory access time.
I hope someone is able to understand what I just said, I know it looks a little complex.
Thanks in advance.
yummig
January 20, 2009, 1:42pm
2
I suspect that there is a coding error somewhere. What I suggest you do is within the kernel, write out the number of iterations the for loop must cycle for each thread and then copy it back to the host and compare notes. If you don’t encounter a crash or driver reboot, then its likely your kernel is doing nothing since the watchdog timer isn’t timing out. You may find that your iteration limit is being evaluated to 0 or one of your threads is messing with you around. Be aware that the compiler will unroll loops to which are deterministic in size i.e. for(int i = 0; i < 10; i++) will iterate 10 times irrespective of runtime conditions - this topic is talked about in the programming guide and the compiler manual.
Thank you for your help.
And I notice a fact that:
In the kernel, there is a triple nested loop,
If i change the form of the triple nested loop in the kernel to the form a single loop as the following psuedo-code shows,
the CUDA seems to work right.
I guess the problem is CUDA will first estimate the times of the loop,
but a triple loop is hard to estimate. and the depth in the following code segment is not a constant but a variable
[codebox]
for(x=1;x<=depth;x++)
for(y=1;y<=depth;y++)
for(z=1;z<=depth;z++)
[/codebox]
=====>
[codebox]
for(i=1;i<=depthdepth depth;i++)
x=... y=... z=....
[/codebox]
I suspect that there is a coding error somewhere. What I suggest you do is within the kernel, write out the number of iterations the for loop must cycle for each thread and then copy it back to the host and compare notes. If you don’t encounter a crash or driver reboot, then its likely your kernel is doing nothing since the watchdog timer isn’t timing out. You may find that your iteration limit is being evaluated to 0 or one of your threads is messing with you around. Be aware that the compiler will unroll loops to which are deterministic in size i.e. for(int i = 0; i < 10; i++) will iterate 10 times irrespective of runtime conditions - this topic is talked about in the programming guide and the compiler manual.
Thank you for your help.
And I notice a fact that:
In the kernel, there is a triple nested loop,
If i change the form of the triple nested loop in the kernel to the form a single loop as the following psuedo-code shows,
the CUDA seems to work right.
I guess the problem is CUDA will first estimate the times of the loop,
but a triple loop is hard to estimate. and the depth in the following code segment is not a constant but a variable
[codebox]
for(x=1;x<=depth;x++)
for(y=1;y<=depth;y++)
for(z=1;z<=depth;z++)
[/codebox]
=====>
[codebox]
for(i=1;i<=depthdepth depth;i++)
x=... y=... z=....
[/codebox]
I suspect that there is a coding error somewhere. What I suggest you do is within the kernel, write out the number of iterations the for loop must cycle for each thread and then copy it back to the host and compare notes. If you don’t encounter a crash or driver reboot, then its likely your kernel is doing nothing since the watchdog timer isn’t timing out. You may find that your iteration limit is being evaluated to 0 or one of your threads is messing with you around. Be aware that the compiler will unroll loops to which are deterministic in size i.e. for(int i = 0; i < 10; i++) will iterate 10 times irrespective of runtime conditions - this topic is talked about in the programming guide and the compiler manual.