CUDA limit for loops..? too large number of iterations?

Hi!

I’m currently working an a CUDA program which essentially executes the same loop over and over.

That is, every thread is doing something like this:

  • generate random number*
  • read from shared memory array
  • write to shared memory array

I’ll just write this in pseudo-code, since everything works quite well for most inputs.

However, if the program runs for too long, my whole memory seems to be obliterated.

Now I’m not really sure if this has anything to do with some intern CUDA looping limit, but currently, it’s my best guess. So if anyone can tell me whether such a limit really exists, the help would be appreciated!
Basically I just need to know whether this could be the source of my problem, or if there is no such limit and I have to look elsewhere.

*the random thing works with clock() calls… I’m not sure whether this can overflow, but this should be irrelevant… I think.

Therez a 5-second limit on CUDA kernels.

Are you running on Windows or Linux?

Regarding loop-limit - I dont know. Never seen this discussed in the forum.

I’d guess that you have a memory leak or buffer overrun or something of the sort. What about running the emulated version in Valgrind?

[deleted, double]

@Sarnath: 5 second limit? What exactly do you mean by that? I ask, because the (kernel) program already runs successfully for inputs which produce run times up to ten seconds.
And I’m currently working under Linux.

@kristleifur: I guess you could be right… I’m wondering though why it runs without disturbance on sligthly smaller problems, which are doing exactly the same, just not as often…

But I guess it’s time to get my hands dirty with this and give the emulator a shot.

I think that is the 5 seconds watchdog thingy… I heard that if you execute your program under windows on a GPU that is also feeding you monitor the program is killed after 5 seconds by a watchdog. this can also be the thing under Linux but I don’t know that for sure.

It seems to be the same on Linux - if I program a bad kernel, my system hangs for around 5 seconds.

but i have seen discussions in this forum talking about no-5sec-watchdog on Linux…

If you look on google they say there is some watchdog on linux but I don’t know if it is a standard thing or that you need to install it.

With a single GPU:

  1. There IS a 5 second watchdog when using X WINDOWS on Linux
  2. There IS NO 5 second watchdog when NOT USING X windows on Linux (text only console w/o X running in the background)

With multiple GPUs:
… I don’t know because I don’t have a multi-GPU system :(

Thanks for the clarification. I think I will insall Linux again or just kill my X windows…

Therez a way to stop X windows from popping up everytime you bring up your machine… No need to re-install.

Check out /etc/inittab OR ask some Linux expert.

GDM or KDM are usually the daemons that keep X11 alive. On my distribution, Ubuntu, I kill X11 by doing ‘sudo /etc/init.d/gdm stop’. If you want to semi-permanently disable X11 on bootup, you’ll have to munge the ‘/etc/rc?.d’ dirs, or your distribution’s equivalent. Check ‘runlevel’ if you want to see what condition your condition is in. (’/etc/inittab’ is another way of configuring bootup services AFAIK.)

My computer restarts without any warnings just goes black and restarting computer. Is this the Watchdog? I should say he kills the process but here I get a soft reset. Is there any way to change the behavior of the watchdog?

The watchdog should only kill the process, not reset the computer. I’ve never seen a CUDA program crash reset the whole system. Are you using the latest drivers? Have you upgraded to the latest motherboard BIOS?

See http://forums.nvidia.com/index.php?showtopic=58436 where a linux machine hangs and then reboot coz of CUDA program.

Yes, well what I meant is that I’ve never seen such a reboot personally with my own eyes. The person in that post is just asking for major problems when they try to run 120 infinite loops all at the same time.

Yes, and I don’t see any explanation for stopping the reboots. Only Wumpus came with something which I don’t quite understand. the termination criteria…

Nevermind, nobody seems to see my point.

  1. When was the last time you ran 120 processes all running infinite loops on your CPU? Was the system responsive enough that you could actually kill them? The head node on a cluster I use regularly runs up a load average of only 10 by being a file server for the nodes, and even that is enough to make the system so unresponsive that I cannot even run “top” after logging in.

The point is, when you push the system so far out of the realm of normal operating procedure you have to expect something to break. CUDA is very stable and the watchdog performs well when you run a reasonable number of applications/threads at once.

I saw instant reboots when I used driver 169.09. Driver versions 169.07 and 169.12 were OK. Probably unrelated though.