Issues with GTX280 and Mandelbrot

Hi!

We got some of the new GTX280 the last days and experienced that most of our programs freeze after some time. The applications worked with the 8800 GTX with no problem before. We did some testing after that and came to the following:

*) OpenSuse 64-bit + 8800 GTX with display driver 169.12 + CUDA 1.1 with a scripted stress test works.

*) Ubuntu 32-bit + 8800 GTX + newest drivers and CUDA 2.0: The setup we used till now and our programs worked. With the update to CUDA Toolkit 2.0 beta 2 there occur some issues with the above mentioned stress tests.

*) Ubuntu 32- and 64-bit + GTX 280, Windows 32-bit + GTX 280: The same as above (freeze :-( )

*) Ubuntu 32- and 64-bit + GTX 280, Windows 32-bit + GTX 280: A program that was compiled on the old setup with 32-bit + 8800 GTX + 169.12/ CUDA 1.1 worked (at least did not freeze in the testing period).

Did anybody test the GTX 280 with the Mandelbrot or similar programs using the current drivers? Everything looks like that it is a CUDA toolkit issue. Did we miss some nvcc compiler flags?

Kind regards,
Manuel

to use double support : -arch sm_13

I had freezes too in the beginning in linux, but I did not have the latest toolkit installed. So make sure you have the very latest driver & toolkit installed. I have a Tesla, so I am not sure if mandelbrot will work when display has to be done on 8800GTX. If I don’t forget I will test monday.

169.12 has no support for the GTX280. If that’s the driver that you were using for all your testing, then that could certainly explain instability.

  1. Please ensure that you’re testing with 177.13 on Linux and 177.35 on Windows
  2. Attach an nvidia-bug-report.log from the system(s) where you’re experiencing the instability.
  3. Can you setup a serial console to capture the crash output?
  4. Are you using Mandlebrot as released in the 2.0 SDK, or an older version?
  5. What is the output from “nvcc -V” on your system where you’ve built the app(s) that are causing instability?

All stuff with the GTX280 is tested with the newest driver and CUDA 2.0 beta 2. Mandelbrot was tested especially on the GTX280 machines (Ubuntu 32 + 64 bit and Windows – all the same result of crashing after some time).

I will do that when i’m back to office (Monday).

Is there a fast possibility to describe how to do this?

Regards,

Manuel

Already found a nice tutorial. I will try that as soon as possible. What logs and reports do you want to see? Or just a backtrace or sth like that?

Regards,

Manuel

I’d like to see the complete serial console output starting from boot until the system crashes, along with an nvidia-bug-report.log.

I did some test with GTX280 and the Mandelbrot example (we have the same problems with our programs…)

done

mwerlberger@tweety:~$ dmesg | grep NVRM

[   35.435507] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  177.13  Tue Jun 10 16:42:55 PDT 2008

added two logs as appendix. Once done in idle mode and another one when the XServer hangs (from ssh remote)

I also did and attach with gdb and a bt full. But when i compile with debug flags that was not possible. For release mode the output is:

(gdb) bt full

#0  0x00007fc463814d53 in select () from /lib/libc.so.6

No symbol table info available.

#1  0x00007fc4614872b6 in ?? () from /usr/lib/libxcb.so.1

No symbol table info available.

#2  0x00007fc461488e5a in xcb_wait_for_reply () from /usr/lib/libxcb.so.1

No symbol table info available.

#3  0x00007fc461ae3f78 in _XReply () from /usr/lib/libX11.so.6

No symbol table info available.

#4  0x00007fc464a0475e in ?? () from /usr/lib/libGL.so.1

No symbol table info available.

#5  0x00007fc4649dae03 in ?? () from /usr/lib/libGL.so.1

No symbol table info available.

#6  0x00007fc462773150 in ?? () from /usr/lib/libGLcore.so.1

No symbol table info available.

#7  0x00007fc4649c8ae6 in ?? () from /usr/lib/libGL.so.1

No symbol table info available.

#8  0x00007fc4649ce961 in glXSwapBuffers () from /usr/lib/libGL.so.1

No symbol table info available.

#9  0x00007fc464257ac3 in glutSwapBuffers () from /usr/lib/libglut.so.3

No symbol table info available.

#10 0x00000000004054b8 in displayFunc ()

No locals.

#11 0x00007fc46425ef13 in ?? () from /usr/lib/libglut.so.3

No symbol table info available.

#12 0x00007fc464262169 in fgEnumWindows () from /usr/lib/libglut.so.3

No symbol table info available.

#13 0x00007fc46425f7df in glutMainLoopEvent () from /usr/lib/libglut.so.3

No symbol table info available.

#14 0x00007fc46425fc48 in glutMainLoop () from /usr/lib/libglut.so.3

No symbol table info available.

#15 0x0000000000405980 in main ()

No locals.

not done yet. Is that also ok with an remote ssh connection? Or is the seriel console needed to execute the program directly from there?

The newest SDK was used for testing.

mwerlberger@tweety:~$ nvcc -V

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2007 NVIDIA Corporation

Built on Tue_Jun_10_04:42:57_PDT_2008

Cuda compilation tools, release 1.1, V0.2.1221

I also did some testing with the nvcc flags for double presicion. Just makes Mandelbrot freeze Xorg right at program start.

Thanks for any suggestions.

Manuel

Thanks. Are you able to add another GPU to the system such that the GTX280 is only used for CUDA, and not display rendering?

Hi!

Yes this is possible. I already added a 8800GTX to the existing setup. Till now i use the 8800GTX for CUDA calculation to continue with the development. I will change my xorg setup tomorrow in a way that the 280GTX can be used for calculation only.

Is there a particular test i should run?

Regards,

Manuel

done… same problems as before. Mandelbrot again freezes the XServer. Just change the color, zoom and pan a little bit and the fun is over again (fast in every sense). The only positive thing is that with our programs we sometimes get an undefined launch failure instead of a hanging X. But in fact this does not make the situation much better…

The setup did not change except the 8800GTX as device 0 for display rendering of course. Mandelbrot was adapted in a way that cuda uses device 1 (the 280GTX) for calculations.

Any further suggestions? In my optinion everything still points to a bug within CUDA 2.0b2?

[edit]

I played around with my 2 cards setup. It turns out that also the 8800 GTX freezes when Mandelbrot is compiled with CUDA 2.0b. It takes much longer till the XServer hangs but still it is an issue because the stuff should run longer than just a few minutes. I appended a log when a freeze happens with the Mandelbrot application with the 8800GTX as cuda device.

[/edit]

Regards,

Manuel

From your bug report, it looks like you’re running both X and CUDA on the 8800GTX while the GTX280 sits 100% idle. Are you certain that you’re using the 8800GTX for X only?

I just tested with a cudaGetDevice within the fps calculation of the Mandelbrot application. Therefore it returns device 1 which is the GTX280 in my setup.

Manuel

Does nobody else have problems with the new cards in combination with CUDA? Since we can reproduce the error with different setups using CUDA Toolkit 2.0 (Win/Linux, 32/64-bit, 8800GTX/GTX280 and all variations) with an SDK and with other algorithms, i do not think it is a problem with our specific setup.

Would be great if anybody could report about experienced behaviour too.

Regards,
Manuel

I’m not able to reproduce any stability problems with Mandelbrot using 177.13 and a GTX280.

Were all of your tests using the same motherboard?
Have you verified that you’ve applied the most recent motherboard BIOS?

We tested the algorithms on three different systems with three different motherboards (Intel & Nvidia chipsets). I also installed the most recent bios updates, but the problems still remain.

Regards,
Markus

Could you try to lower memory/core clock by 20-40% and try again?

I lowered all clocks by 30% … the programs still crash.

As I already reported, cudaGetDevice is useless, it reports whatever you last passed to e.g. cudaSetDevice, regardless if it works or not. For me a more reliable method was to look at the GPU temperatures with nvidia-settings.

I looked at the temp too and the right one was bound to the CUDA stuff. Maybe this is important and i did not mention that before: We use a EVGA FTW card. Don’t konw if there are some known issues with specific cards? But normally EVGA knows what they are doing?

Thx,

Manuel

It does not work with a GTX 280 from ZOTAC either …