185.18.10 CUDA does not work, 180.X sires sorta works....

Hi All,

I posted this on the NVNews site, but they’re referring me here. All 185.X series drivers do not work with GPUGrid. The 185 drivers try to initialize the work, but give a 100% complete and subsequently fail the task. The newer 180.X series (180.37 and up, I suppose) work, but a lot of temporary screen freezes occur, with an occasional screen lock that does not recover. Here’s the bug reports for an earlier 180 series (that worked with screen freezes) and a 185 series (that did not work) for debugging purposes.

Mike Doerner

Anybody at Nvidia working on this?

Mike D

Bueller?

This is not much of a repro case…

You stated that you originally reported this on NVNews. Did anyone else confirm that they were also experiencing the same problem?

Its not entirely clear to me what this failure even looks like from your current description. I’d like to see the log(s) from GPUGRID/BOINC that include the failure.

Does this reproduce if you run GPUGRID while X is NOT running?

I’ve only tried running BOINC in X. How do you run it in the shell?

Basically, under the 185 driver, a task will download, and as soon as CUDA starts to work on it, it claims the task is 100% complete, and tries to upload. Under the 180 driver, the task will begin the timer and will progress to 0.295% within a few minutes. 9600 GSO if that helps. OpenSuSe 11.1 and KDE 4.2.2.

Mike D

I’ve never tried to run it under X, but I certainly wouldn’t’ expect stellar performance under those conditions. GPUGRID tends to consume all free time on a GPU, so even if everything were working perfectly, performance wouldn’t be zippy.

All that you need to do to run is run the ‘boinc’ executable. I’d suggest reading the BOINC documentation for more information on all the options.

Also, I’d still like to see the following information:

  1. You’re the only person that I’m aware who has reported this. Has anyone else reported this problem that you know of?

  2. Please attach the BOINC log that includes the failure.

??? Sorry if the nvidia-bugreport doesn’t show what you need. Have you run any tasks from GPUGrid under the 185 driver?

Mike D

Since I don’t know of anyone else running CUDA on Linux for GPUGrid, that may make my case unique… :">

After poking around some more documentation, I found this little gem in the CUDA release notes hereCUDA 2.2 Release Notes

The ‘offending’ snippet is here…

When compiling with GCC, special care must be taken for structs that contain 64-bit integers. This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default. Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC. When using NVCC, this option is automatically passed to GCC.

OK, now when I recompile the 185.18.14 driver with -malign-double, CUDA exhibits a different behavior. Instead of CUDA trying to process the tasks, and immediately calling them 100% complete, it now just sits there with the tasks in the queue and doesn’t do anything. (I think this is an improvement??!?!)

To me, it looks like it’s a flag issue with the way gcc compiles the driver. Either I need ALL the other flags to make this thing work, or I need to grab a copy of Nvidia’s NVCC compiler and re-compile the driver with that compiler. Anybody know where I can grab NVCC? Thanks.

Mike Doerner

OK, I’ve got the CUDA toolkit on my system (OpenSUSE 11.1 and KDE 4.2.2) but I don’t see a flag in NVIDIA-Linux-x86_64-185.18.14-pkg2.run to use the nvcc compiler. I think using nvcc or getting the appropriate gcc flags will solve this problem, since -malign-double fixed the “immediate 100% completion” problem. Presently, BOINC grabs the tasks from GPUGrid, but does not change the stratus on the 1st task from “Ready to Start” to “Running”.

Mike Doerner

You can’t use nvcc to compile the host driver interface. nvcc is designed for compiling CUDA code into executable GPU payloads and preprocessing/annotating host source with CUDA driver API functions to get those payloads running on the GPU. The intermediate host code produced by nvcc must be passed to the host C compiler for compilation into host object code. It has nothing to do with driver installation or compilation.

NVIDIA ship their host drivers/gpu firmware as a binary blob with a kernel interface wrapper which needs to be compiled against a given kernel configuration and source tree to produce a kernel module. That must be done with the same compiler that built the target kernel, which will be your vendor gcc.

OK, that makes sense. Then what flags should be enabled? Without the -malign-double flag, CUDA says it’s 100% complete on a task even though in reality it has just begun computation. With -malign-double set in CFLAGS, CUDA doesn’t start computation, but then again it doesn’t screw up either. I’d like to know what flags MUST be enabled to get the driver to compile properly. This worked in the 180.X series drivers, but has never worked correctly on the 185.X series drivers on my system. FWIW, gcc 4.3.3 is the default in OpenSUSE 11.1.

Mike Doerner

You shouldn’t have to set any compiler flags. The NVIDIA installer will (and must) use the exact compiler flags used to build the running kernel, which it will read out of the kernel configuration file. It should be a complete “hands off” process.

You seem to have latched onto something in the CUDA release notes, but it is a complete red herring. What is discussed in the release notes is a remark about data alignment when compiling user space host code which will share data with CUDA kernels running on the GPU. It has absolutely nothing to do with building the kernel driver module.

OK, so I’m back to square 1. How how do you get the 185.X series of CUDA to function? The standard installation does not work. Using the same installation procedure with any 180.X series gets CUDA working, though with frozen screens on occasion…

Mike Doerner

How is 185.x not functioning? Can you point me to where you provided the information that I requested last week?

I presume you are basing this “it doesn’t work” purely on the fact that gpugrid doesn’t complete tasks. You mentioned you have installed the CUDA 2.2 toolkit. If you install the 2.2 SDK you should be able to build the examples therein and see if they pass or not. That will provide independent confirmation of whether the CUDA driver is working properly or not.

Bug reports are included in the top post. What other information do you require? CUDA doesn’t run without the X-server, I don’t think.

Compared to the help I’ve recieved on NVNews site, this isn’t very helpful. If you want me to help you get more information beyond the 2 bug reports and a description of the problem, you need to tell me a specific step-by-step procedure on what you’d like me to try.

Dammit, Jim, I’m a mechanical engineer, not a CUDA developer… :thumbup:

PS I’ll add a BOINC log like you’ve requested in the next post…

Mike Doerner

How do I do this? What test examples should I use? How does this test BOINC/GPUGrid? Please be as specific as possible, I’m a mechanical engineer, not a computer engineer. A step-by-step procedure would be helpful.

Mike Doerner

Go to the same place you downloaded the CUDA toolkit from. Follow the onscreen instructions to select the CUDA SDK version for your linux distribution and then install it.

Any of them will achieve the desired result, but the simplest is deviceQuery. It will use the CUDA driver API to query your GPU and print out its CUDA capabilities and hardware features. If that works, your CUDA installation and drivers work.

It doesn’t. It tests CUDA and confirms you have working drivers. If you can compile and run the SDK examples, then you problem lies somewhere other than the CUDA drivers.

OK, it seems 185.18.14 is exhibiting a slightly different problem. 185.18.14 does not start any GPUGrid tasks within BOINC. Heres’ the lastest bug report, screenshot, and BOINC log…

Mike Doerner