thread sudden death ?

phantom7026 · January 19, 2012, 7:12am

Well…
in my cuda program, there is a kernel function including a “for loop” for float computations.
In general, the for loop iterates many times such as 2^20.
Actually, the exponent depends on the input data. … anyway… very big number.

By the way, my program sometimes does not execute the kernel function normally.
(the MPI version of the program always works. I mean the algorithm or something is correct.)

I have checked…
and now… I know some threads do not finish the “for loop” normally.
That is, they quit the loop suddenly. In each execution, the point where the threads stop working is different.

I wonder why this happens…

Don’t you have this kinds of experience ?

tera · January 19, 2012, 11:14am

Run your program under [font=“Courier New”]cuda-memcheck[/font] to find wrong memory accesses. How long does the program run - could you be triggering the watchdog timer?

phantom7026 · January 19, 2012, 4:26pm

dear tera~

Thank you for your attention.

Following your comment, I ran my program under cuda-memcheck.
However, no error is found, even though the program quits abnormally.

By the way, I also guess some memory problem.

Because I found that accessing some memory (local memory. of array…) lets the program quit.
hmm… it is very difficult…

Keldor314 · January 19, 2012, 9:14pm

Sounds like a watchdog timer timeout.

phantom7026 · January 20, 2012, 4:27am

dear tera and Keldor314.

Thank you very much.

The problem was by watchdog timer.

Because this is my first cuda programming, especially for big data, I have not been this kinds of GPU programming problem.
To be honest, I didn’t know “wathdog timer”, before your comments.

Now, my program is working well.

I really appreciate your attention and comments.