CUDA limit for loops..? too large number of iterations?

Apparently the new CFS scheduler in linux (especially when using user-id based fairness, forgot the name they actually have for it) makes a terminal responsive when running thousands of infinite loops. But I have not upgraded yet ;)

But to answer your question: I do get your point and I wonder a lot what the use for infinite loops in CUDA is, as you will have no way to get any data back to CPU. Maybe they want to use the GPU as a heater???

“GPU as a heater” – GOod one :-)

Actually, when I came back at my job after christmas holidays, our heating had broken. All places next to my Dell XPS720H2C with 2x8800GTX were taken that day :)

Please note that I have an issue with this; I actually need X11 to start for my GPU to initiate at correct (i.e. full) clock speed. With the 169.09 driver, if I run deviceQuery before X11 starts, it displays half the clock speed. If I then start X11 and terminates it (gdm start ; gdm stop) the speed is now at max. As long you are aware of this, it is not a problem. I don’t know if it is related to my hardware (G92 with i780). With the 169.07 driver, I do not need to start X11 to get full speed.

  • Kuisma

Could you try to use the global memory instead of shared memory for test?

Actually I got the same problem, even I use the global memory.

Please let me know if you know how to solve this problem, many thanks!

I did try global memory and rewrote the program accordingly, but unfortunately this did not solve the problem.

A bit more testing revealed that the program behaves as if the kernel was not called at all, though the program spends several seconds executing the kernel, so the GPU is at least doing something.

I’m not currently working on this, but I will get back to it within the next few days. My best guess at the moment is that the kernel execution somehow fails (is there a way to check why…?) due to bad resource management (too many registers used or something similar) and I need to re-work the data load per thread/block.

(And thanks to all the other guys trying to help! The watchdog thing should not be the source of my problem, since I’m using putty to code via ssh on a Linux machine. Also, the program already takes more than 5 seconds to complete one successful call.)

call CUT_CHECK_ERROR after your kernel call. Big possibility is that you hit the 5 sec limitation, or overwrite some memory you have not allocated. Check the SDK examples for the use of CUT_CHECK_ERROR and CUDA_SAFE_CALL, etc. They are really useful!

The 5 second limit should not be the problem, since the program runs successful for up to 12 seconds and returns the correct answer. If it takes longer though, my problem appears.

CUT_CHECK_ERROR etc. is already in place but unfortunately doesn’t say anything.

As for the memory: I will check that again, though I doubt that there is something wrong. Not because I’m above such mistakes (quite the contrary ;) ), but rather because I think such a problem would occur much earlier.
But, better safe than sorry, so checking this surely won’t hurt…

Are you compiling in debug mode? If you are not, then CUT_CHECK_ERROR has no effect.

And the 5s watchdog is for individual kernel calls, not entire program invocations. Or are you only calling 1 kernel in the whole program and have essentially 0 initialization?