NDA expiration - new GF100 information

I had silent segmentation faults on a CPU many times. It is not hard to exceed array bounds in a C++ program and overwrite some other variable which is also yours, so OS does not complain. On a GPU most of the memory is “yours” as there are no other programs besides your kernel, running in parallel.

Most of my (silent or not) segmentation faults on GPU are because of array out-of-bounds actions. To immediately track the problem during debugging I simply check the index on every access.

template <typename T>

class Array {

private:

	unsigned int size;

	T *data;

public:

	   [...] //some boring constructors

	__host__ __device__ inline T &operator[](int addr) {

		if (addr<0 || addr>=size) {

			debugLong(60,"<"<<blockIdx.x<<":"<<threadIdx.x<<"> Out of bounds! ["<<size<<"]->"<<addr);

			return data[0];

		}

		return data[addr];

	}

	__host__ __device__ inline const T &operator[](int addr) const {

		if (addr<0 || addr>=size) {

			debugLong(60,"<"<<blockIdx.x<<":"<<threadIdx.x<<"> Out of bounds! ["<<size<<"]->"<<addr);

			return data[0];

		}

		return data[addr];

	}

};

(debugLong is a macro which prints a message on the screen from within the kernel, something simmilar to cuPrintf)

Thats not entirly correct, you can use an onboard card (I think it must support CUDA) but is chip and will be responsible for the screen and the “other” GPU will run the kernels.

The second point, you cannot be sure that your program is 100 % correct. So there is always a chance that your application (kernel) can crash and when you are running the graphics driver on your CUDA card your PC can crash completely. That can happen during development (good) or when it's already running on the client's PC (very bad).

Right, much like any other software… I caused many deadlocks to the system with “simple non GPU” multi-threaded code.

You have to have a good QA department and you should run your code through tools such as valgrind, Ocelot, a simple test like detailed

below from Cygnus ( very nice, btw :) )

Right again for every software in the world. Blue screens is something that NVIDIA didnt invent it was some one else.

Bottom line - the tools, debuggers, profilers and validators for GPUs will improve during time, but it doesnt say you cant provide a stable

environment with the current hardware and tools

my 1 cent, anyway :)

eyal

I’m aware that you can have also “silent segmentation faults” on the CPU. That’s why I wrote not malloced data on the GPU. When you access not allocated data on the CPU you will get always a segmentation error, thats the difference.

Doing error checking by hand is an option, but you must assume that your error checking code is absolutely correct. For “static algorithms” whose control flow does not depend on user input you will get all errors, but you have a problem when your execution of your algorithm is influenced by the user input. You can’t statically check for errors.

The best option would be to do a kind of memory protecten in hardware or at least by the debugger.

NVIDIA is trying to make the devolopment easier and the C++ support show that they want making CUDA devolopment near as easy as on the CPU. A reasonable step would be hareware debugging support and memory protection. All I want is information about that.

Thank you for your replies and your suggestions!

http://www.youtube.com/watch?v=iUouQy7Ohus

Also, if you’re just worried about your end product (the medical device), then I’d run a stripped-down Linux distribution on an Ion-based board. So you get stability, low power usage, and CUDA acceleration (not much, but maybe enough for what you’re doing). Another option is to build your device normally, then host some compute-only servers with Teslas or whatever in the building, and send the data there for computation.

In any case, the debugging tools available now for CUDA (those provided by nVidia, and third-party emulators like Ocelot and barra) should let you write code that is rock-solid and isn’t going to crash the system. The majority of the bugs I’ve seen people have (or had myself) had to do with out-of-bounds memory accesses, which you can find pretty easily with valgrind (or one of the emulators above). Also, for maximum stability, run some console-only Linux distribution…the next biggest group of crashes involves kernels that run too long for the display watchdog and cause the driver to reset.

This youtube video shows an internal windows 7 bug and this crash is probably caused by operting system functions. I never said Windows can’t crash, but try to write a matrix multiplication that will crash your PC using only user mode commands and a certain amount of memory allocations. You won’t be able to do that on the CPU.

The advantage of GPU computing compared to an additional application acceleration device (TESLA Card, CELL Card) is that you have already this device in your PC for 3D graphics output like volume rendering. So it makes sense to use it also for acceleration purposes without extra costs.

Using an Atom processor on a ION plattform would decrease general peformance and a Core 7i would be probably much faster than using CUDA cores on the ION GPU. Addtional TESLA servers will rise the costs and copying big data sets over a network connection will be slow. It’s already a performance problem to copy data from PC RAM to GPU. What benefit would I get? The windows application I’m talking about already exists and has grown for many years. Switching to Linux is therefore impossible.

Due to these constraints I have to stay with Windows and I have only one GPU. In future it will be a Fermi card.

The debuging support on windows is very limited (emulation mode). Unfortunately Valgrind is only available for Linux, but I will install a virtual machine and do some error checking with Valgrind. This solution is however complicated and time consuming. Of course it is possible, but the productivity decreases.

Is it with NEXUS now possible to debug on a PC with a single device?

Fermi will be able to run multiple kernels so the problem arises that one kernel can overwrite another. This must be handeld somehow.

IIRC memory is protected on the GPU and the addresses are virtual. A GPU segfault should never bring the system down.

I know it has happened, at least on older drivers. Is it still an issue? Does someone have a minimal repro that hangs the system?

I’ve actually just written an app in which I intentionally go beyond array bounds. Kernel invocation returns an error (caught by cudaGetLastError) and either the screen blips momentarily or it goes black for a couple of seconds to return after a while with Windows saying that the driver has stopped responding and successfully recovered. I can relaunch the application after this.

This is on Windows 7 x64. I know Windows XP couldn’t handle a driver crash gently but it seems it’s not that bad now. It’s still nasty that a kernel error can bring the driver down…

Big_Mac is right…the address spaces are virtual, and allocated per-context, so two (or more) concurrent kernels shouldn’t be able to write into each other’s memory.

During the time I a devloped a volume segmentation algorithm I had a number of PC crashes. The last one occured on Monday. The desktop freezes and rarely I have strange colors on my screen. Then I have to reset my PC, but it also happend that after waiting for some minutes the PC started working again. This is all unpredictable. Starting the same kernel multiple times can lead either to complete crash or to receiving the error message “Unspecified launch failure”.

I observed this behavior on Windows XP x64 with Quadro Drivers. Might this problem be caused by the combination of driver and operating system?

If you’ve got something that can reliably crash your machine with the latest drivers (196.21 on Windows now, I guess), then you should post a repro case.

Not really… It all boils down to page granualarity.

Page protection has a minimal granualarity of 4K (max gran of 4MB on intel). Say you have a static array “int array[100]” in your data-segment. If we assume “array” is page-aligned then you can still access the entire 4K without getting any faults on the CPU.

I am SURE GPU has memory protection. I have written kernels probing memory using arbitrary addresses and have seen segfaults… It may be possible that GPU memory pages are “huge” and hence lot of scope for silent faults…

Since GPU addresses are context-based, this should NOT affect other contexts… If you are having such a problem, it should be reported as Tim suggested.

Win XP x64 SP 2

GPU: Quadro FX 4800

Quadro Driver 191.00

CUDA Version: 2.3

[attachment=15430:sysinfo.jpg]

This Kernel crashes my system reliably.

#include <stdio.h>

#include <cuda_runtime.h>

#include <limits.h>

__global__ void

killGPU(float* f)

{

	for(int i=0; i > INT_MIN; i--)

		f[-i] = 0;

}

int main(int argc, char** argv)

{

	cudaError_t status;

	float* d_data;

	status = cudaMalloc((void**) &d_data, 1024);

	if(status != cudaSuccess)

	{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

	for(int i= 0; i < 100; i++)

	{

		killGPU<<<256, 512>>>(d_data);

		status = cudaGetLastError();

		if(status != cudaSuccess)

		{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

	}

	status = cudaFree(d_data);

	if(status != cudaSuccess)

	{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

}

Sarnath you are right, with your example of page protection. Maybe it’s an issue of the Quadro driver. Now I will install the latest Geforce driver an test if the crash still occurs.

It seems to be a Quadro driver issue , at least with the version (191.00) I used. This was the latest Quadro driver at the time when I installed CUDA 2.3.

With the latest Geforce driver 196.21 I wasn’t able to crash my system. It freezes for some seconds, but recovers after that.

This makes things much better. I thougt it was a general problem.

The only thing missing is a better windows debug support on the device, but Nexus will do that?

Now I have tested the Geforce driver 196.21. Until yet I haven’t manged to crash the PC, but after the recovery the driver seems to be not working correctly.

FurMark 1.65 (and my application) show a broken display output.

Here are the screens:

Before starting my killGPU kernel

[attachment=15431:fur_mark…recovery.jpg]

After starting my killGPU kernel

[attachment=15432:fur_mark…recovery.jpg]

I have tested the latest Quadro driver also 191.78 and it crashed my PC. Not in every run, but after some runs of the test kernel.

For time to time, (like once every 100 crashes) I see similar artifacts ;)

It probably depends what you have overwritten with you kernel. I tried it several times and I have always this artifacts.

I can confirm that it gives me artifacts even on Windows 7. It doesn’t kill the system though, I get the popup saying the driver died and was restored. I get all kinds of random pixels on my desktop afterwards and it seems the only way to clear it is a reboot. Good catch!

Could we have a comment on this by NVIDIA?

Gave this a quick try on a newer driver on my Server 2008 machine with a G84 and a GT200, and while the G84 eventually timed out there was still no display corruption. I’ll poke around a bit more later, but at the moment I haven’t been able to repro it.

With Windows XP x64 I get no general display corruption in Windows desktop, but only when I start a 3D application (FurMark for example). The display corruption in Windows 7 comes probably from the use of Aero which uses 3D. This is my guess.

Indeed, I use Aero.

Another good article came out… it includes more details but also more speculation.

The eye-opening quote from page 5 this article is that the consumer Fermi may have its DP throughput reduced by 75%… the DP powerhouse would be reserved for Tesla.
This is unconfirmed… first I’ve seen of it anyway.