Dangers of programming your GPU

External Media

Any clues? =p

I was programming some image training application which allocated 200 720x480 grayscale images (float) so, 260mb on my 9800GTX,and some aditional buffers that summed up about 6mb, and got a cool rain of randon pixels on my screen plus a message saying the device driver was not responding ^_^

One question, if I allocate memory on host does this takes up adressing space on the videoboard? makes sense I think, if I have 300mb allocated on the video and more 300 allocated on the host (cudaMallocHost) and the videoboard has to “SEE” this memory, than it should take it’s adressing space on the GPU, is that the case?

(I run riva tunner there but I just use it to keep track of the GPU’s temperature, no overclocking as you can see in the gadget, and this was a bad screenshot it was way more noisy than that, just got it at the wrong time)

This can happen if your kernel writes past the end of the memory that you have allocated for it.

cudaMalloc allocates memory on the device. (make sure you check for error return values…)

malloc/new/cudaMallocHost/cudaAlloc only allocate memory on the host. All copies from the host->device are done with DMA which doesn’t waste any memory on the device.

No memory protection on a GPU? Perhaps this should go into the feature request thread (although it’s more of a hardware issue).

yeah, I used to get this all the time. Just add some bounds checking and infinite loop breaking code, which can be turned off by the preprocessor.

“danger” is a bad characterization, it’s just corrupting GPU memory. Also, I recommend calling sync on Linux machines (can’t find a windows equivalent with a quick google search) before invoking the kernel to make sure the source file has been saved.

regards,
Nicholas

There is memory protection, it’s just that the recovery functions weren’t quite robust enough before. I think this has been fixed now (with the very latest driver, I can sigkill a Linux app running an infinite looping CUDA kernel and the machine keeps working absolutely fine–how do you like that). I think the XP/Vista protection enhancements come with 2.2 and the Linux ones come in the driver immediately after.

I don’t quite see what that’s got to do with memory protection. I’m thinking of: if I have a kernel which starts zeroing out all of the memory on the card running my display, will it be caught, or will I watch X11 lock up. I’ve had at least one case where I had X lock up on me, and it turned out (when I ran the code through valgrind in emulation) that I was scribbling off the end of an array. I realise I should have kept a test case, but coming up with one is tedious when the machine requires rebooting everytime you have a ‘success’ …

There is memory protection. All of the addresses are virtual; there’s no mapping from random addresses to other places in memory with sensitive data. The problem was that before, there were times when a context could do something bad (segfault, write to bad addresses, etc), the driver would kill it, but the driver wouldn’t entirely recover. Maybe X would crash, driver would crash later on, etc., it was bad. This has been fixed.

awesome :) when’s this post-2.2 linux driver coming?

some time I got the same your problem.
In that case I used a lot of device memory, so I think this problem occurs when using a lot of device memory.
Tested machine is using XP sp2. and one only GPU for both purpose graphic card and CUDA computing.

Understood - and good to hear it has been fixed :)

Yep, that happened to me. Make sure you free any allocated memory :)