NDA expiration - new GF100 information

If you’ve got something that can reliably crash your machine with the latest drivers (196.21 on Windows now, I guess), then you should post a repro case.

Not really… It all boils down to page granualarity.

Page protection has a minimal granualarity of 4K (max gran of 4MB on intel). Say you have a static array “int array[100]” in your data-segment. If we assume “array” is page-aligned then you can still access the entire 4K without getting any faults on the CPU.

I am SURE GPU has memory protection. I have written kernels probing memory using arbitrary addresses and have seen segfaults… It may be possible that GPU memory pages are “huge” and hence lot of scope for silent faults…

Since GPU addresses are context-based, this should NOT affect other contexts… If you are having such a problem, it should be reported as Tim suggested.

Win XP x64 SP 2

GPU: Quadro FX 4800

Quadro Driver 191.00

CUDA Version: 2.3

[attachment=15430:sysinfo.jpg]

This Kernel crashes my system reliably.

#include <stdio.h>

#include <cuda_runtime.h>

#include <limits.h>

__global__ void

killGPU(float* f)

{

	for(int i=0; i > INT_MIN; i--)

		f[-i] = 0;

}

int main(int argc, char** argv)

{

	cudaError_t status;

	float* d_data;

	status = cudaMalloc((void**) &d_data, 1024);

	if(status != cudaSuccess)

	{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

	for(int i= 0; i < 100; i++)

	{

		killGPU<<<256, 512>>>(d_data);

		status = cudaGetLastError();

		if(status != cudaSuccess)

		{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

	}

	status = cudaFree(d_data);

	if(status != cudaSuccess)

	{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

}

Sarnath you are right, with your example of page protection. Maybe it’s an issue of the Quadro driver. Now I will install the latest Geforce driver an test if the crash still occurs.

It seems to be a Quadro driver issue , at least with the version (191.00) I used. This was the latest Quadro driver at the time when I installed CUDA 2.3.

With the latest Geforce driver 196.21 I wasn’t able to crash my system. It freezes for some seconds, but recovers after that.

This makes things much better. I thougt it was a general problem.

The only thing missing is a better windows debug support on the device, but Nexus will do that?

Now I have tested the Geforce driver 196.21. Until yet I haven’t manged to crash the PC, but after the recovery the driver seems to be not working correctly.

FurMark 1.65 (and my application) show a broken display output.

Here are the screens:

Before starting my killGPU kernel

[attachment=15431:fur_mark…recovery.jpg]

After starting my killGPU kernel

[attachment=15432:fur_mark…recovery.jpg]

I have tested the latest Quadro driver also 191.78 and it crashed my PC. Not in every run, but after some runs of the test kernel.

For time to time, (like once every 100 crashes) I see similar artifacts ;)

It probably depends what you have overwritten with you kernel. I tried it several times and I have always this artifacts.

I can confirm that it gives me artifacts even on Windows 7. It doesn’t kill the system though, I get the popup saying the driver died and was restored. I get all kinds of random pixels on my desktop afterwards and it seems the only way to clear it is a reboot. Good catch!

Could we have a comment on this by NVIDIA?

Gave this a quick try on a newer driver on my Server 2008 machine with a G84 and a GT200, and while the G84 eventually timed out there was still no display corruption. I’ll poke around a bit more later, but at the moment I haven’t been able to repro it.

With Windows XP x64 I get no general display corruption in Windows desktop, but only when I start a 3D application (FurMark for example). The display corruption in Windows 7 comes probably from the use of Aero which uses 3D. This is my guess.

Indeed, I use Aero.

Another good article came out… it includes more details but also more speculation.

The eye-opening quote from page 5 this article is that the consumer Fermi may have its DP throughput reduced by 75%… the DP powerhouse would be reserved for Tesla.
This is unconfirmed… first I’ve seen of it anyway.

Sad, but if true, I’m not surprised. I’m certain the GeForce has cannibalized Tesla sales for workstation customers. Double precision is a reasonable way to segment the market, since the majority of the GeForce market doesn’t care about double precision anyway. (Although I would argue with the author that more than just hackers in Eastern Europe are bummed about this. Not all of have the budget to spend $2k per card when we would like a dozen of them.)

The only bonus would be if this increases the yield of chips going to the GeForce (perhaps by allowing chips with defective double units to be sold) and therefore help lower the price. If this turns out to be purely a limitation enforced in firmware with no yield benefits, then that will be very depressing.

I use Windows 7 64bit but I have my Aero disabled and it looks more or less like my old Windows 98. (I like simple rectangular windows without any fancy stuff). Nevertheless I believe that some basic 3D system might be still online, even if not really used…

:ike:

Will there be an update on this problem? Could you already repro it?

The names of the cards were announced…they’ll be the 400-series:

http://www.anandtech.com/weblog/showpost.aspx?i=668

EDIT: Link to the Twitter account: http://twitter.com/NVIDIAGeForce

The new Kirk and Hwu book focuses on G80 architecture, but there is a few pages talking in general about “the new Fermi architecture which arrived as this book was going to press.”

The clues mentioned by the book start with the 64 bit address space. It specifically says how useful this is to map host and device memory into a single address space, making it easier for the GPU to seamlessly access host data. (This is obviously done via zero-copy like transparent transfers over the PCIE bus). This wasn’t news, though it wasn’t actually announced before I think. The book also mentions this extends to MULTIPLE GPUs, giving each non colliding addresses, and allowing device<->device transfers in a peer to peer method. This implies bus mastering though the book doesn’t use that term. The book does say the true potential of the peer to peer memory transfer may take “years to fully exploited in the SDK”. Meaning maybe that we won’t get this feature immediately, though the hardware is capable?

The book does say that certain standard library functions are supported… specifically using printf() in kernels. This doesn’t sound like the cuPrintf() library, it sounds like lower level, since “this can lead to system call traps.”

Next, there’s a small section that talks about the multiple kernel feature of Fermi. The book says this means it’s better to use small block counts now since you don’t need to worry about idle SMs any more. This answers a question we had though… in my interpretation it seems like kernels blocks now DYNAMICALLY span SMs… meaning that your kernel might start using 10 out of 16 SMs then during its lifetime use 12, or 4, and then back to 10… it’s load balanced with other kernels which are also running.

Kernels are now interruptable. This is an FAQ here on the forum!

We already knew this, but it says that kernels even when debugging will not interfere with the system or display, even during a kernel crash. This implies to me that Nexus will work with a single Fermi card. (Currently it requires 2 GPUs to run on a single system, one for display, one for compute).

Interruptible as in pause/resume, or asynchronously killable?

Looks like cancelled. But again there’s vague wording.