CUDA 2.1 discussion

I don’t believe that the official NVIDIA display driver installs the kernel module in /lib/modules/

Can you confirm where that came from?

You are a wise man… It looked funny to me as well, though I did not explicitly recognize it at the time. I don’t know where it came from, but it is in the same directory as a kqemu.ko module, suspiciously…and it was a link to an nvidia.ko module from an earlier kernel version. Very peculiar. I deleted the damn thing, and lo and behold, problem solved! Starts up o.k.; hopefully the violent interaction with firefox (or the stupid firefox flash player?) has been resolved. Thanks for the prompt!

This is just a standard Suse installation, so someone has befuddled their application packaging…

After testing for an hour or so running flash-infested sites in Firefox + FlightGear + framegrabber in TV-time + SDK examples from an earlier 2.1 release:

… I was actually still up and running :-D

Then I am not sure what happened. but suddenly I got out of memory errors from CUDA? I have now recompiled my own stuff against 2.1 and giving it all another try.

Pinned bandwidthTest - device to host - is down (here) from 2.0 GB to 1,5 GB
OTOH, alignedTypes does not fail, which is new to me.

Edit: No problems whatsoever for the last 5 days (untill a poweroutage hit us this morning.) The improvements to the 2D X-server are tremendous though and I think I will stick with this release for a while.
One thing I have avoided this time around is experimenting with the over / underclock feature. It does not work anyway … or perhaps it does, but is doing something funny?

All the links should be working now–sorry about that.

The 181.20 driver to go with this SDK doesn’t quite work with some of my PCs that have one or more nVidia 9600GSO (384/768 MB respectively)

installed. Has it been verified that this driver is compatible with the 9600GSO? (Windows XP Prof. and Vista 64 bit editions respectively)


If it’s not, I assume the INF can be modified to support whatever you want.

Is there any more information with this version about OpenGL interoperability? In particluar I’m looking for what OpenGL calls are legal to be called on a pbo that is currently registered with cuda. The docs say ‘draw commands’, but what exactly does this encompass? is glTexSubImage valid? Docs in 2.1 don’t seem to have more info.

Just to recap, I have a standard 64-bit Suse 10.3 system on an intel Q6600-based PC using a gtx 260, with a 650 Watt powersupply to feed the beast.

This latest CUDA release, recently installed successfully, seemed to be working o.k. for a time, but it has shown instabilities - I just lost X Windows and a number of running processes that I wish I hadn’t lost. It has also seemed slower, alas. The CUDA process that I strongly suspect triggered the crash was a matlab mex routine that I was running as a test - I have that pretty well debugged by now and have run it many times before without a problem. The routine gets called from a matlab loop, and the whole thing runs for about a half hour or so. It basically just calls the CUBLAS.

I think its back to CUDA 2.0 for me; I really need the stability on my desktop! (and not so much CUDA 2.1 just yet).

I am,
your most humble servant,


Later: " It has also seemed slower, alas." I take that back; I have no good evidence that the newer version is slower or faster. No reliable measurement.

The driver installs but I get a black screen on boot (“No Signal” on my LCD screen) - that is more worrysome than just a broken INF.

I agree the documentation could be a bit clearer here.

Drawing commands are anything that only reads from the buffer object (e.g. glDrawPixels, glTex(Sub)Image2D). You only have to unregister the buffer object when you want to do any calls that write to it (e.g. glReadPixels, glBufferData etc.).

The postProcessGL sample in the SDK shows how to do this.

Hope this helps.

Compilation with 2.1 takes much longer than with 2.0.
With 2.0 my kernel compiles in few seconds, but 2.1 spends nearly a minute doing something in be.exe with 100% CPU load and produces rougly the same code (I am judging by number of registers used and overall performance).

They should be porting the compiler to CUDA ;)

I don’t have the same experience in my own kernels. I have however a benchmark kernel from someone else that took a long time in 2.0, and seems to take more in 2.1. To me it looks like large kernels with lots of loop unrolling take more time to compile. Maybe the compiler tries harder to optimize things?

Everything seems A-OK so far on the servers, though the final word on stability won’t be known until these guys have been running jobs for a few weeks w/o problems.

One minor annoyance is that the text mode display blanking problem has gotten worse. I’m on linux amd64, rhel5.2 with NO X running and just a text mode console. With CUDA 1.1/2.0/and their betas, running a CUDA app would blank the screen. Not a big problem: once the app was done pressing enter brought it back. Well, in CUDA 2.1, the screen seems to stay blanked until I switch to another virtual console and back (ctrl-alt-F2, ctrl-alt-F1). This could potentially be very annoying if the machine were ever to appear to crash from a remote login and I go to the connected keyboard/monitor to try and debug the issue. If I hit enter a few times and nothing showed up on the screen, I’d think the system was dead when it may not be!

Oops, I spoke too soon. CUDA 2.1 doesn’t work on the box with 2 9800 GX2s are in it. HOOMD has some really weird behavior and bandwidthTest reports like 30000000000 MB/s of bandwidth and then segfaults. Attaching nvidia-bug-report.log. The configuration is: ASUS P5Q pro MB, 8 GiB DDR2 memory, 2 9800 GX2 cards, running rhel5.2 x86_64.

all right, will test this as soon as I get a chance. thanks.

Yes, I think they’ve adjusted something so that nvcc tries harder to optimize code/unroll loops/allocate registers.

I have anoter kernel for which 2.0 produced mch slower code than 1.1, will try it with 2.1 tomorrow.

Section of the programming guide - the text refers to __alignof, yet the code example doesn’t reflect this at all…

You’ll save a lot of people headaches if you updated this appropriately - speaking from experience having to figure this out myself over a period of 2 working days… is frustrating.

Edit (more stuff):

  • Is there a reason “builtin_types.h” isn’t included by the cuda driver api header? (or at least “vector_types.h”)
    ** It would be helpful if “vector_types” had the proper alignment set for the Driver API (eg: when the host compiler isn’t nvcc)…
    ** Also, if you want to avoid filename clashes with other libraries that clients might use in conjunction with CUDA, the CUDA includes should be nested in a ‘cuda’ directory or some such - ‘builtin_types.h’ isn’t exactly ‘unique’ to cuda, and I wouldn’t be at all surprised if there were all sorts of other C/C++ APIs out there with headers called that (also, probably not inside of their own directory… sigh)
  • I know you guys don’t care about compiler support outside of gcc/msvc, but cudaRoundMode & cudaChannelFormatKind don’t have a starting values for their first entry - nor do any of the “texture_types.h” enums.

We are super-lazy and do not feel that renaming files is necessary ever.

(I am lying, they have since been renamed. Sorry about that. No differences in the files, though.)

@Tim, Thanks for fixing this. The SDK section still needs to be fixed though.

Best Regards,


A secret goodie in the 2.1 toolkit is found in the smokeParticles source folder. It’s a super-duper implementation of radix sorting, discussed in this paper. Performance is roughly twice as good as CUDPP’s current sort!

The paper is excellent, too, well worth understanding just to help learn ways to approach CUDA, modifying the algorithm to match the architecture.